Bug 1291096 - 7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped during pacemaker_resource_restart.sh
Summary: 7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: async
: 7.0 (Kilo)
Assignee: Peter Lemenkov
QA Contact: Alexander Chuzhoy
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-12-13 20:38 UTC by Marius Cornea
Modified: 2019-09-09 15:17 UTC (History)
24 users (show)

Fixed In Version: rabbitmq-server-3.3.5-25.el7ost
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-05 19:14:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
deployment error (294.70 KB, text/plain)
2015-12-13 20:38 UTC, Marius Cornea
no flags Details
nova-scheduler.log (436.07 KB, text/plain)
2015-12-14 10:49 UTC, Jiri Stransky
no flags Details
'rabbitmqctl report' incomplete output: it fails to dump the list of channels (18.25 KB, text/plain)
2015-12-18 15:31 UTC, Victor Stinner
no flags Details

Description Marius Cornea 2015-12-13 20:38:23 UTC
Created attachment 1105367 [details]
deployment error

Description of problem:
7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped. The update fails during ControllerPostPuppetRestartDeployment while waiting for the openstack-keystone resource to get stopped. 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy 7.0 by using the 7.0 templates:
openstack overcloud deploy \
    --templates ~/templates/my-overcloud \
    --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \
    --ntp-server clock.redhat.com \
    --libvirt-type qemu \
    -e ~/templates/my-overcloud/environments/network-isolation.yaml \
    -e ~/templates/network-environment.yaml \
    -e ~/templates/firstboot-environment.yaml \
    -e ~/templates/ceph.yaml 

2. Update to 7.2 by following the update procedure:
/usr/bin/yes '' | openstack overcloud update stack overcloud -i \
         --templates ~/templates/my-overcloud \
         -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \
         -e ~/templates/my-overcloud/environments/network-isolation.yaml \
         -e ~/templates/network-environment.yaml \
         -e ~/templates/firstboot-environment.yaml \
         -e ~/templates/ceph.yaml \
         -e ~/templates/my-overcloud/environments/updates/update-from-keystone-admin-internal-api.yaml

Actual results:
overcloud  | UPDATE_FAILED 

Expected results:
Update completes ok.

Additional info:
Attaching the deployment error output.

Comment 2 Jiri Stransky 2015-12-14 10:10:19 UTC
This is another problem when services can't stop themselves in a timely manner for some reason. Bumping the timeout wouldn't help this time, nova-scheduler really failed to stop.

Failed resources in overcloud stack:

| ControllerNodesPostDeployment                 | f4a097f1-3e93-4649-b2a5-06a325a711bb          | OS::TripleO::ControllerPostDeployment             | UPDATE_FAILED   | 2015-12-13T19:35:26Z |                                               |
| ControllerPostPuppet                          | 86797237-96f3-43e4-b14e-683f0f3185cd          | OS::TripleO::Tasks::ControllerPostPuppet          | CREATE_FAILED   | 2015-12-13T19:53:53Z | ControllerNodesPostDeployment                 |
| ControllerPostPuppetRestartDeployment         | 4322a876-a26c-44e5-b653-b928ebee12d2          | OS::Heat::SoftwareDeployments                     | CREATE_FAILED   | 2015-12-13T19:53:54Z | ControllerPostPuppet                          |
| 0                                             | 35e75221-55ce-4c0d-b9f4-efb7ba5b1d66          | OS::Heat::SoftwareDeployment                      | CREATE_FAILED   | 2015-12-13T19:54:51Z | ControllerPostPuppet

From os-collect-config log:

pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds.
pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone never stopped after 1200 seconds

Failures from pcs status:

* openstack-nova-scheduler_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=413, status=Timed Out, exitreason='none',
    last-rc-change='Sun Dec 13 14:57:58 2015', queued=0ms, exec=99987ms
* neutron-server_monitor_60000 on overcloud-controller-0 'not running' (7): call=382, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=37ms
* neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=379, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=30ms
* openstack-nova-scheduler_stop_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=401, status=Timed Out, exitreason='none',
    last-rc-change='Sun Dec 13 14:57:58 2015', queued=0ms, exec=99988ms
* neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=364, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=16ms
* openstack-nova-scheduler_stop_0 on overcloud-controller-1 'OCF_TIMEOUT' (198): call=667, status=Timed Out, exitreason='none',
    last-rc-change='Sun Dec 13 14:59:38 2015', queued=13ms, exec=1ms
* neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=633, status=complete, exitreason='none',
    last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=18ms

Nova-scheduler journal:

pro 13 13:04:33 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
pro 13 13:04:35 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
pro 13 13:04:35 overcloud-controller-0.localdomain nova-scheduler[17334]: /usr/lib64/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:153: SAWarning: The IN-predicate on "instances.host" was invoked 
pro 13 13:04:35 overcloud-controller-0.localdomain nova-scheduler[17334]: 'strategies for improved performance.' % expr)
pro 13 13:54:37 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Nova Scheduler Server...
pro 13 13:54:37 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server.
pro 13 14:01:45 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
pro 13 14:01:46 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
pro 13 14:45:30 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler...
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service stop-sigterm timed out. Killing.
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service: main process exited, code=killed, status=9/KILL
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: Unit openstack-nova-scheduler.service entered failed state.
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service failed.
pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler...
pro 13 14:47:01 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler.
pro 13 14:57:58 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler...
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service stop-sigterm timed out. Killing.
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service: main process exited, code=killed, status=9/KILL
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server.
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: Unit openstack-nova-scheduler.service entered failed state.
pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service failed.

nova-scheduler.log is too long for inline post, will attach it.

Comment 3 Mike Orazi 2015-12-14 10:44:59 UTC
Juan or Adam,

Can we get some help understanding why the keystone services may take a long time to shut down and whether it is a

Comment 4 Mike Orazi 2015-12-14 10:45:57 UTC
Sorry about that.  Intended for Eoghan rather than Juan or Adam.

Comment 5 Jiri Stransky 2015-12-14 10:49:38 UTC
Created attachment 1105542 [details]
nova-scheduler.log

Comment 6 Sylvain Bauza 2015-12-14 13:45:35 UTC
The below pattern seems to have a oslo.messaging problem when calling down the scheduler :

2015-12-13 13:54:37.387 17334 INFO nova.openstack.common.service [req-28775323-1249-4714-b569-c1277e5e2458 - - - - -] Caught SIGTERM, exiting
2015-12-13 13:54:37.772 17334 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: 

By looking at how the oslo-incubator service (in OSP7) asks for stopping oslo.messaging, it seems that when there is a problem, it doesn't stop the service.

Could you please provide a Guru Mediation Report for the nova-scheduler service by calling :

kill -SIGUSR <nova_scheduler_pid>

https://www.berrange.com/posts/2015/02/19/nova-and-its-use-of-olso-incubator-guru-meditation-reports/

I could see what threads are locking.

Comment 7 Sylvain Bauza 2015-12-14 13:46:32 UTC
My bad, I meant :

kill -USR1 <pid>

Comment 8 Marius Cornea 2015-12-14 14:50:35 UTC
At this point there is no nova-scheduler process running on the controller nodes. Please let me know if I may provide other logs that turn to be useful.

Comment 16 Victor Stinner 2015-12-18 15:31:53 UTC
Created attachment 1107197 [details]
'rabbitmqctl report' incomplete output: it fails to dump the list of channels

Right now, it looks like rabbitmq is unable to list the channels. It's possible to *open* channels, but not to *close* channels.

"rabbitmqctl list_channels" hangs before providing any output.

"rabbitmqctl report" hangs just after writing "Channels:".

Comment 19 Victor Stinner 2015-12-18 15:58:02 UTC
I used tcpdump to see what happens. nova-scheduler is able to open a channel in RabbitMQ (it sends Channel.Open and got the reply Channel.Open-ok), but it's unable to close a channel: it sends Channel.Close but RabbitMQ doesn't reply.

So the bug is not in nova-scheduler nor in Oslo Messaging, bug in RabbitMQ directly.

According to Petr, it might be the bug https://github.com/rabbitmq/rabbitmq-server/commit/7dbfe9d "which means your networking was partitioned".

Reassign to Petr with the component RabbitMQ.

Comment 21 Peter Lemenkov 2016-11-08 13:44:32 UTC
Status report. This ticket is a duplicate of bug 1337704. A solution to that bug is too complex to backport it to 3.3.5. However we did backport a few patches which adds timeouts to various rabbitmqctl commands as part of rabbitmq-server-3.3.5-25.el7ost build. Please consider upgrading to that version. Timeouts mentioned in comment 16 will be gone as soon as you upgrade to that build.

Comment 24 Alexander Chuzhoy 2017-01-05 19:14:31 UTC
The update is done to latest (at the moment it's 7.3), so closing this BZ. Nobody should update to 7.2.
Closing.

Comment 27 Amit Ugol 2018-05-02 10:53:12 UTC
closed, no need for needinfo.


Note You need to log in before you can comment on or make changes to this bug.