Description of problem: OSP11 -> OSP12 upgrade: upgrade times out during major-upgrade-composable-steps-docker on composable roles deployment because rabbitmq is not reachable. This is a composable roles deployment consisting of: 3 x controller nodes 3 x database nodes 3 x messaging nodes 2 x networker nodes 3 x compute nodes major-upgrade-composable-steps-docker.yaml because as a result of the upgrade there's no rabbitmq server running on the environment and neutron-server.service gets stuck in activating state, trying to reach one of the rabbitmq servers: [root@controller-0 heat-admin]# systemctl status neutron-server.service ● neutron-server.service - OpenStack Neutron Server Loaded: loaded (/usr/lib/systemd/system/neutron-server.service; enabled; vendor preset: disabled) Active: activating (start) since Tue 2017-08-15 23:08:02 UTC; 9h ago Main PID: 187942 (neutron-server) Memory: 102.5M CGroup: /system.slice/neutron-server.service └─187942 /usr/bin/python2 /usr/bin/neutron-server --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --confi... Aug 15 23:08:02 controller-0 systemd[1]: Starting OpenStack Neutron Server... Aug 15 23:08:02 controller-0 neutron-server[187942]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR...enerate reports. Hint: Some lines were ellipsized, use -l to show in full. [root@controller-0 heat-admin]# tail -5 /var/log/neutron/server.log 2017-08-16 08:31:49.485 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-2.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED 2017-08-16 08:31:50.495 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None: error: [Errno 111] ECONNREFUSED 2017-08-16 08:32:22.536 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-1.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED 2017-08-16 08:32:23.547 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-2.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED 2017-08-16 08:32:24.568 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None: error: [Errno 111] ECONNREFUSED Version-Release number of selected component (if applicable): openstack-tripleo-heat-templates-7.0.0-0.20170805163048.el7ost.noarch How reproducible: 100% Steps to Reproduce: 1. Deploy a custom roles OSP11 deployment with the following setup: 3 x controller nodes 3 x database nodes 3 x messaging nodes 2 x networker nodes 3 x compute nodes 2. Upgrade to OSP12 Actual results: major-upgrade-composable-steps-docker.yaml eventually times out as the upgrade leaves no running rabbitmq server in the cluster. Expected results: major-upgrade-composable-steps-docker.yaml completes succesfully. Additional info: I noticed that the rabbitmq-clone pcs resource wasn't deleted during upgrade: [root@controller-0 heat-admin]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: messaging-0 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Wed Aug 16 08:34:10 2017 Last change: Tue Aug 15 23:03:27 2017 by root via cibadmin on controller-0 15 nodes configured 40 resources configured (9 DISABLED) Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 redis-bundle-0@controller-1 redis-bundle-1@controller-2 redis-bundle-2@controller-0 ] Full list of resources: Clone Set: rabbitmq-clone [rabbitmq] Stopped (disabled): [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ] ip-192.168.24.15 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.102 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.12 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.19 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.17 (ocf::heartbeat:IPaddr2): Started controller-2 openstack-cinder-volume (systemd:openstack-cinder-volume): Started controller-0 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:2017-08-15.1] rabbitmq-bundle-docker-0 (ocf::heartbeat:docker): Started messaging-0 rabbitmq-bundle-docker-1 (ocf::heartbeat:docker): Started messaging-1 rabbitmq-bundle-docker-2 (ocf::heartbeat:docker): Started messaging-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-15.1] galera-bundle-0 (ocf::heartbeat:galera): Master database-0 galera-bundle-1 (ocf::heartbeat:galera): Master database-1 galera-bundle-2 (ocf::heartbeat:galera): Master database-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-08-15.1] redis-bundle-0 (ocf::heartbeat:redis): Slave controller-1 redis-bundle-1 (ocf::heartbeat:redis): Slave controller-2 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-0 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:2017-08-15.1] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-2 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-0 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Looking at the sequence of event that happened on node messaging-0 while resource rabbitmq-clone was being deleted, I can see the following unexpected log: Aug 15 22:48:10 messaging-0 crmd[16468]: notice: Result of stop operation for rabbitmq on messaging-0: 0 (ok) Aug 15 22:48:24 messaging-0 ansible-pacemaker_resource[175956]: Invoked with check_mode=False state=delete resource=rabbitmq timeout=300 wait_for_resource=True Aug 15 22:48:25 messaging-0 cib[16463]: error: IDREF attribute rsc references an unknown ID "rabbitmq-clone" Aug 15 22:48:25 messaging-0 cib[16463]: error: IDREF attribute rsc references an unknown ID "rabbitmq-clone" Aug 15 22:48:25 messaging-0 cib[16463]: warning: Updated CIB does not validate against pacemaker-2.8 schema/dtd Aug 15 22:48:25 messaging-0 cib[16463]: warning: Local-only Change (client:cibadmin, call: 2): 0.78.0 (Update does not conform to the configured schema) Aug 15 22:48:25 messaging-0 cib[16463]: warning: Completed cib_delete operation for section //clone/primitive[@id="rabbitmq"]/..: Update does not conform to the configured schema (rc=-203, So there might be several things at play here which ultimately make the resource deletion fail, and apparently went unnoticed.
Created https://bugzilla.redhat.com/show_bug.cgi?id=1482116 to handle concurrent CIB updates with pcs.
putting that bug to modified here because it shouldn't reoccur once https://bugzilla.redhat.com/show_bug.cgi?id=1482116 is verified
Marius: Given Damien's comment #3 and the linked bug being VERIFIED, do you have any objections to moving this beyond MODIFIED?
(In reply to Chris Jones from comment #4) > Marius: Given Damien's comment #3 and the linked bug being VERIFIED, do you > have any objections to moving this beyond MODIFIED? No objections, we can also close this as a duplicate of bug 1482116.
*** This bug has been marked as a duplicate of bug 1482116 ***