1291096 – 7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped during pacemaker_resource_restart.sh

Bug 1291096 - 7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped during pacemaker_resource_restart.sh

Summary: 7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped ...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	7.0 (Kilo)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	async
Target Release:	7.0 (Kilo)
Assignee:	Peter Lemenkov
QA Contact:	Alexander Chuzhoy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-12-13 20:38 UTC by Marius Cornea
Modified:	2019-09-09 15:17 UTC (History)
CC List:	24 users (show)
Fixed In Version:	rabbitmq-server-3.3.5-25.el7ost
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-01-05 19:14:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
deployment error (294.70 KB, text/plain) 2015-12-13 20:38 UTC, Marius Cornea	no flags	Details
nova-scheduler.log (436.07 KB, text/plain) 2015-12-14 10:49 UTC, Jiri Stransky	no flags	Details
'rabbitmqctl report' incomplete output: it fails to dump the list of channels (18.25 KB, text/plain) 2015-12-18 15:31 UTC, Victor Stinner	no flags	Details
View All

Description Marius Cornea 2015-12-13 20:38:23 UTC

Created attachment 1105367 [details]
deployment error

Description of problem:
7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped. The update fails during ControllerPostPuppetRestartDeployment while waiting for the openstack-keystone resource to get stopped. 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy 7.0 by using the 7.0 templates:
openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml 

2. Update to 7.2 by following the update procedure:
/usr/bin/yes '' | openstack overcloud update stack overcloud -i \
         --templates ~/templates/my-overcloud \
         -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
         -e ~/templates/my-overcloud/environments/network-isolation.yaml \
         -e ~/templates/network-environment.yaml \
         -e ~/templates/firstboot-environment.yaml \
         -e ~/templates/ceph.yaml \
         -e ~/templates/my-overcloud/environments/updates/update-from-keystone-admin-internal-api.yaml

Actual results:
overcloud  | UPDATE_FAILED 

Expected results:
Update completes ok.

Additional info:
Attaching the deployment error output.

Comment 2 Jiri Stransky 2015-12-14 10:10:19 UTC

This is another problem when services can't stop themselves in a timely manner for some reason. Bumping the timeout wouldn't help this time, nova-scheduler really failed to stop.

Failed resources in overcloud stack:

| ControllerNodesPostDeployment                 | f4a097f1-3e93-4649-b2a5-06a325a711bb          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2015-12-13T19:35:26Z |                                               |
| ControllerPostPuppet                          | 86797237-96f3-43e4-b14e-683f0f3185cd          | OS::TripleO::Tasks::ControllerPostPuppet          | CREATE_FAILED   | 2015-12-13T19:53:53Z | ControllerNodesPostDeployment                 |
| ControllerPostPuppetRestartDeployment         | 4322a876-a26c-44e5-b653-b928ebee12d2          | OS::Heat::SoftwareDeployments                     | CREATE_FAILED   | 2015-12-13T19:53:54Z | ControllerPostPuppet                          |
| 0                                             | 35e75221-55ce-4c0d-b9f4-efb7ba5b1d66          | OS::Heat::SoftwareDeployment                      | CREATE_FAILED   | 2015-12-13T19:54:51Z | ControllerPostPuppet

From os-collect-config log:

pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone never stopped after 1200 seconds

Failures from pcs status:

* openstack-nova-scheduler_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=413, status=Timed Out, exitreason='none',
    last-rc-change='Sun Dec 13 14:57:58 2015', queued=0ms, exec=99987ms
* neutron-server_monitor_60000 on overcloud-controller-0 'not running' (7): call=382, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=37ms
* neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=379, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=30ms
* openstack-nova-scheduler_stop_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=401, status=Timed Out, exitreason='none',
    last-rc-change='Sun Dec 13 14:57:58 2015', queued=0ms, exec=99988ms
* neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=364, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=16ms
* openstack-nova-scheduler_stop_0 on overcloud-controller-1 'OCF_TIMEOUT' (198): call=667, status=Timed Out, exitreason='none',
    last-rc-change='Sun Dec 13 14:59:38 2015', queued=13ms, exec=1ms
* neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=633, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=18ms

Nova-scheduler journal:

pro 13 13:04:33 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
pro 13 13:04:35 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
pro 13 13:04:35 overcloud-controller-0.localdomain nova-scheduler[17334]: /usr/lib64/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:153: SAWarning: The IN-predicate on "instances.host" was invoked 
pro 13 13:04:35 overcloud-controller-0.localdomain nova-scheduler[17334]: 'strategies for improved performance.' % expr)
pro 13 13:54:37 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Nova Scheduler Server...
pro 13 13:54:37 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server.
pro 13 14:01:45 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
pro 13 14:01:46 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
pro 13 14:45:30 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler...
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service stop-sigterm timed out. Killing.
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service: main process exited, code=killed, status=9/KILL
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: Unit openstack-nova-scheduler.service entered failed state.
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service failed.
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
pro 13 14:47:01 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
pro 13 14:57:58 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler...
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service stop-sigterm timed out. Killing.
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service: main process exited, code=killed, status=9/KILL
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server.
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: Unit openstack-nova-scheduler.service entered failed state.
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service failed.

nova-scheduler.log is too long for inline post, will attach it.

Comment 3 Mike Orazi 2015-12-14 10:44:59 UTC

Juan or Adam,

Can we get some help understanding why the keystone services may take a long time to shut down and whether it is a

Comment 4 Mike Orazi 2015-12-14 10:45:57 UTC

Sorry about that.  Intended for Eoghan rather than Juan or Adam.

Comment 5 Jiri Stransky 2015-12-14 10:49:38 UTC

Created attachment 1105542 [details]
nova-scheduler.log

Comment 6 Sylvain Bauza 2015-12-14 13:45:35 UTC

The below pattern seems to have a oslo.messaging problem when calling down the scheduler :

2015-12-13 13:54:37.387 17334 INFO nova.openstack.common.service [req-28775323-1249-4714-b569-c1277e5e2458 - - - - -] Caught SIGTERM, exiting
2015-12-13 13:54:37.772 17334 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: 

By looking at how the oslo-incubator service (in OSP7) asks for stopping oslo.messaging, it seems that when there is a problem, it doesn't stop the service.

Could you please provide a Guru Mediation Report for the nova-scheduler service by calling :

kill -SIGUSR <nova_scheduler_pid>

https://www.berrange.com/posts/2015/02/19/nova-and-its-use-of-olso-incubator-guru-meditation-reports/

I could see what threads are locking.

Comment 7 Sylvain Bauza 2015-12-14 13:46:32 UTC

My bad, I meant :

kill -USR1 <pid>

Comment 8 Marius Cornea 2015-12-14 14:50:35 UTC

At this point there is no nova-scheduler process running on the controller nodes. Please let me know if I may provide other logs that turn to be useful.

Comment 16 Victor Stinner 2015-12-18 15:31:53 UTC

Created attachment 1107197 [details]
'rabbitmqctl report' incomplete output: it fails to dump the list of channels

Right now, it looks like rabbitmq is unable to list the channels. It's possible to *open* channels, but not to *close* channels.

"rabbitmqctl list_channels" hangs before providing any output.

"rabbitmqctl report" hangs just after writing "Channels:".

Comment 19 Victor Stinner 2015-12-18 15:58:02 UTC

I used tcpdump to see what happens. nova-scheduler is able to open a channel in RabbitMQ (it sends Channel.Open and got the reply Channel.Open-ok), but it's unable to close a channel: it sends Channel.Close but RabbitMQ doesn't reply.

So the bug is not in nova-scheduler nor in Oslo Messaging, bug in RabbitMQ directly.

According to Petr, it might be the bug https://github.com/rabbitmq/rabbitmq-server/commit/7dbfe9d "which means your networking was partitioned".

Reassign to Petr with the component RabbitMQ.

Comment 21 Peter Lemenkov 2016-11-08 13:44:32 UTC

Status report. This ticket is a duplicate of bug 1337704. A solution to that bug is too complex to backport it to 3.3.5. However we did backport a few patches which adds timeouts to various rabbitmqctl commands as part of rabbitmq-server-3.3.5-25.el7ost build. Please consider upgrading to that version. Timeouts mentioned in comment 16 will be gone as soon as you upgrade to that build.

Comment 24 Alexander Chuzhoy 2017-01-05 19:14:31 UTC

The update is done to latest (at the moment it's 7.3), so closing this BZ. Nobody should update to 7.2.
Closing.

Comment 27 Amit Ugol 2018-05-02 10:53:12 UTC

closed, no need for needinfo.

Note You need to log in before you can comment on or make changes to this bug.

apevec
augol
berrange
dasmith
eglynn
jeckersb
jjoyce
josorior
jstransk
kchamart
lhh
mburns
mcornea
morazi
nkinder
oblaut
ohochman
rhel-osp-director-maint
sbauza
sferdjao
sgordon
srevivo
vromanso
vstinner