Created attachment 1105367 [details] deployment error Description of problem: 7.0 -> 7.2 update fails due to openstack-keystone resource not being stopped. The update fails during ControllerPostPuppetRestartDeployment while waiting for the openstack-keystone resource to get stopped. Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-0.8.6-92.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy 7.0 by using the 7.0 templates: openstack overcloud deploy \ --templates ~/templates/my-overcloud \ --control-scale 3 --compute-scale 1 --ceph-storage-scale 3 \ --ntp-server clock.redhat.com \ --libvirt-type qemu \ -e ~/templates/my-overcloud/environments/network-isolation.yaml \ -e ~/templates/network-environment.yaml \ -e ~/templates/firstboot-environment.yaml \ -e ~/templates/ceph.yaml 2. Update to 7.2 by following the update procedure: /usr/bin/yes '' | openstack overcloud update stack overcloud -i \ --templates ~/templates/my-overcloud \ -e ~/templates/my-overcloud/overcloud-resource-registry-puppet.yaml \ -e ~/templates/my-overcloud/environments/network-isolation.yaml \ -e ~/templates/network-environment.yaml \ -e ~/templates/firstboot-environment.yaml \ -e ~/templates/ceph.yaml \ -e ~/templates/my-overcloud/environments/updates/update-from-keystone-admin-internal-api.yaml Actual results: overcloud | UPDATE_FAILED Expected results: Update completes ok. Additional info: Attaching the deployment error output.
This is another problem when services can't stop themselves in a timely manner for some reason. Bumping the timeout wouldn't help this time, nova-scheduler really failed to stop. Failed resources in overcloud stack: | ControllerNodesPostDeployment | f4a097f1-3e93-4649-b2a5-06a325a711bb | OS::TripleO::ControllerPostDeployment | UPDATE_FAILED | 2015-12-13T19:35:26Z | | | ControllerPostPuppet | 86797237-96f3-43e4-b14e-683f0f3185cd | OS::TripleO::Tasks::ControllerPostPuppet | CREATE_FAILED | 2015-12-13T19:53:53Z | ControllerNodesPostDeployment | | ControllerPostPuppetRestartDeployment | 4322a876-a26c-44e5-b653-b928ebee12d2 | OS::Heat::SoftwareDeployments | CREATE_FAILED | 2015-12-13T19:53:54Z | ControllerPostPuppet | | 0 | 35e75221-55ce-4c0d-b9f4-efb7ba5b1d66 | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2015-12-13T19:54:51Z | ControllerPostPuppet From os-collect-config log: pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds. pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds. pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds. pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone not yet stopped, sleeping 3 seconds. pro 13 15:16:32 overcloud-controller-0.localdomain os-collect-config[5273]: openstack-keystone never stopped after 1200 seconds Failures from pcs status: * openstack-nova-scheduler_stop_0 on overcloud-controller-0 'OCF_TIMEOUT' (198): call=413, status=Timed Out, exitreason='none', last-rc-change='Sun Dec 13 14:57:58 2015', queued=0ms, exec=99987ms * neutron-server_monitor_60000 on overcloud-controller-0 'not running' (7): call=382, status=complete, exitreason='none', last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=37ms * neutron-openvswitch-agent_monitor_60000 on overcloud-controller-0 'not running' (7): call=379, status=complete, exitreason='none', last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=30ms * openstack-nova-scheduler_stop_0 on overcloud-controller-2 'OCF_TIMEOUT' (198): call=401, status=Timed Out, exitreason='none', last-rc-change='Sun Dec 13 14:57:58 2015', queued=0ms, exec=99988ms * neutron-openvswitch-agent_monitor_60000 on overcloud-controller-2 'not running' (7): call=364, status=complete, exitreason='none', last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=16ms * openstack-nova-scheduler_stop_0 on overcloud-controller-1 'OCF_TIMEOUT' (198): call=667, status=Timed Out, exitreason='none', last-rc-change='Sun Dec 13 14:59:38 2015', queued=13ms, exec=1ms * neutron-openvswitch-agent_monitor_60000 on overcloud-controller-1 'not running' (7): call=633, status=complete, exitreason='none', last-rc-change='Sun Dec 13 14:54:39 2015', queued=0ms, exec=18ms Nova-scheduler journal: pro 13 13:04:33 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler... pro 13 13:04:35 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler. pro 13 13:04:35 overcloud-controller-0.localdomain nova-scheduler[17334]: /usr/lib64/python2.7/site-packages/sqlalchemy/sql/default_comparator.py:153: SAWarning: The IN-predicate on "instances.host" was invoked pro 13 13:04:35 overcloud-controller-0.localdomain nova-scheduler[17334]: 'strategies for improved performance.' % expr) pro 13 13:54:37 overcloud-controller-0.localdomain systemd[1]: Stopping OpenStack Nova Scheduler Server... pro 13 13:54:37 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server. pro 13 14:01:45 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler... pro 13 14:01:46 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler. pro 13 14:45:30 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler... pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service stop-sigterm timed out. Killing. pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service: main process exited, code=killed, status=9/KILL pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: Unit openstack-nova-scheduler.service entered failed state. pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service failed. pro 13 14:47:00 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-nova-scheduler... pro 13 14:47:01 overcloud-controller-0.localdomain systemd[1]: Started Cluster Controlled openstack-nova-scheduler. pro 13 14:57:58 overcloud-controller-0.localdomain systemd[1]: Stopping Cluster Controlled openstack-nova-scheduler... pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service stop-sigterm timed out. Killing. pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service: main process exited, code=killed, status=9/KILL pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Nova Scheduler Server. pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: Unit openstack-nova-scheduler.service entered failed state. pro 13 15:00:15 overcloud-controller-0.localdomain systemd[1]: openstack-nova-scheduler.service failed. nova-scheduler.log is too long for inline post, will attach it.
Juan or Adam, Can we get some help understanding why the keystone services may take a long time to shut down and whether it is a
Sorry about that. Intended for Eoghan rather than Juan or Adam.
Created attachment 1105542 [details] nova-scheduler.log
The below pattern seems to have a oslo.messaging problem when calling down the scheduler : 2015-12-13 13:54:37.387 17334 INFO nova.openstack.common.service [req-28775323-1249-4714-b569-c1277e5e2458 - - - - -] Caught SIGTERM, exiting 2015-12-13 13:54:37.772 17334 ERROR oslo_messaging._drivers.impl_rabbit [-] Failed to consume message from queue: By looking at how the oslo-incubator service (in OSP7) asks for stopping oslo.messaging, it seems that when there is a problem, it doesn't stop the service. Could you please provide a Guru Mediation Report for the nova-scheduler service by calling : kill -SIGUSR <nova_scheduler_pid> https://www.berrange.com/posts/2015/02/19/nova-and-its-use-of-olso-incubator-guru-meditation-reports/ I could see what threads are locking.
My bad, I meant : kill -USR1 <pid>
At this point there is no nova-scheduler process running on the controller nodes. Please let me know if I may provide other logs that turn to be useful.
Created attachment 1107197 [details] 'rabbitmqctl report' incomplete output: it fails to dump the list of channels Right now, it looks like rabbitmq is unable to list the channels. It's possible to *open* channels, but not to *close* channels. "rabbitmqctl list_channels" hangs before providing any output. "rabbitmqctl report" hangs just after writing "Channels:".
I used tcpdump to see what happens. nova-scheduler is able to open a channel in RabbitMQ (it sends Channel.Open and got the reply Channel.Open-ok), but it's unable to close a channel: it sends Channel.Close but RabbitMQ doesn't reply. So the bug is not in nova-scheduler nor in Oslo Messaging, bug in RabbitMQ directly. According to Petr, it might be the bug https://github.com/rabbitmq/rabbitmq-server/commit/7dbfe9d "which means your networking was partitioned". Reassign to Petr with the component RabbitMQ.
Status report. This ticket is a duplicate of bug 1337704. A solution to that bug is too complex to backport it to 3.3.5. However we did backport a few patches which adds timeouts to various rabbitmqctl commands as part of rabbitmq-server-3.3.5-25.el7ost build. Please consider upgrading to that version. Timeouts mentioned in comment 16 will be gone as soon as you upgrade to that build.
The update is done to latest (at the moment it's 7.3), so closing this BZ. Nobody should update to 7.2. Closing.
closed, no need for needinfo.