1481987 – OSP11 -> OSP12 upgrade: upgrade times out during major-upgrade-composable-steps-docker on composable roles deployment because rabbitmq is not reachable

Bug 1481987 - OSP11 -> OSP12 upgrade: upgrade times out during major-upgrade-composable-steps-docker on composable roles deployment because rabbitmq is not reachable

Summary: OSP11 -> OSP12 upgrade: upgrade times out during major-upgrade-composable-ste...

Keywords:
Status:	CLOSED DUPLICATE of bug 1482116
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ga
Target Release:	12.0 (Pike)
Assignee:	Michele Baldessari
QA Contact:	Amit Ugol
Docs Contact:
URL:
Whiteboard:
Depends On:	1482116
Blocks:
TreeView+	depends on / blocked

Reported:	2017-08-16 08:35 UTC by Marius Cornea
Modified:	2017-10-25 09:43 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-10-25 09:43:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Marius Cornea 2017-08-16 08:35:06 UTC

Description of problem:
OSP11 -> OSP12 upgrade: upgrade times out during major-upgrade-composable-steps-docker on composable roles deployment because rabbitmq is not reachable.

This is a composable roles deployment consisting of:
3 x controller nodes
3 x database nodes
3 x messaging nodes
2 x networker nodes
3 x compute nodes

major-upgrade-composable-steps-docker.yaml because as a result of the upgrade there's no rabbitmq server running on the environment and neutron-server.service gets stuck in activating state, trying to reach one of the rabbitmq servers:

[root@controller-0 heat-admin]# systemctl status neutron-server.service
● neutron-server.service - OpenStack Neutron Server
   Loaded: loaded (/usr/lib/systemd/system/neutron-server.service; enabled; vendor preset: disabled)
   Active: activating (start) since Tue 2017-08-15 23:08:02 UTC; 9h ago
 Main PID: 187942 (neutron-server)
   Memory: 102.5M
   CGroup: /system.slice/neutron-server.service
           └─187942 /usr/bin/python2 /usr/bin/neutron-server --config-file /usr/share/neutron/neutron-dist.conf --config-dir /usr/share/neutron/server --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini --confi...

Aug 15 23:08:02 controller-0 systemd[1]: Starting OpenStack Neutron Server...
Aug 15 23:08:02 controller-0 neutron-server[187942]: Guru meditation now registers SIGUSR1 and SIGUSR2 by default for backward compatibility. SIGUSR1 will no longer be registered in a future release, so please use SIGUSR...enerate reports.
Hint: Some lines were ellipsized, use -l to show in full.
[root@controller-0 heat-admin]# tail -5 /var/log/neutron/server.log 
2017-08-16 08:31:49.485 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-2.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
2017-08-16 08:31:50.495 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
2017-08-16 08:32:22.536 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-1.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
2017-08-16 08:32:23.547 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-2.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds. Client port: None: error: [Errno 111] ECONNREFUSED
2017-08-16 08:32:24.568 187942 ERROR oslo.messaging._drivers.impl_rabbit [req-4f9cae4a-9d2a-49f5-bab6-02748edcc7f4 - - - - -] [fb37673e-db74-4a7f-836d-86a8c7e9f3d8] AMQP server on messaging-0.internalapi.localdomain:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 32 seconds. Client port: None: error: [Errno 111] ECONNREFUSED


Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-7.0.0-0.20170805163048.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy a custom roles OSP11 deployment with the following setup:
3 x controller nodes
3 x database nodes
3 x messaging nodes
2 x networker nodes
3 x compute nodes
2. Upgrade to OSP12

Actual results:
major-upgrade-composable-steps-docker.yaml eventually times out as the upgrade leaves no running rabbitmq server in the cluster.

Expected results:
major-upgrade-composable-steps-docker.yaml completes succesfully.

Additional info:

I noticed that the rabbitmq-clone pcs resource wasn't deleted during upgrade:

[root@controller-0 heat-admin]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: messaging-0 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Wed Aug 16 08:34:10 2017
Last change: Tue Aug 15 23:03:27 2017 by root via cibadmin on controller-0

15 nodes configured
40 resources configured (9 DISABLED)

Online: [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
GuestOnline: [ galera-bundle-0@database-0 galera-bundle-1@database-1 galera-bundle-2@database-2 redis-bundle-0@controller-1 redis-bundle-1@controller-2 redis-bundle-2@controller-0 ]

Full list of resources:

 Clone Set: rabbitmq-clone [rabbitmq]
     Stopped (disabled): [ controller-0 controller-1 controller-2 database-0 database-1 database-2 messaging-0 messaging-1 messaging-2 ]
 ip-192.168.24.15	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-10.0.0.102	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.12	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-172.17.1.11	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.3.19	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.17	(ocf::heartbeat:IPaddr2):	Started controller-2
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started controller-0
 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:2017-08-15.1]
   rabbitmq-bundle-docker-0	(ocf::heartbeat:docker):	Started messaging-0
   rabbitmq-bundle-docker-1	(ocf::heartbeat:docker):	Started messaging-1
   rabbitmq-bundle-docker-2	(ocf::heartbeat:docker):	Started messaging-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp12/openstack-mariadb-docker:2017-08-15.1]
   galera-bundle-0	(ocf::heartbeat:galera):	Master database-0
   galera-bundle-1	(ocf::heartbeat:galera):	Master database-1
   galera-bundle-2	(ocf::heartbeat:galera):	Master database-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp12/openstack-redis-docker:2017-08-15.1]
   redis-bundle-0	(ocf::heartbeat:redis):	Slave controller-1
   redis-bundle-1	(ocf::heartbeat:redis):	Slave controller-2
   redis-bundle-2	(ocf::heartbeat:redis):	Slave controller-0
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp12/openstack-haproxy-docker:2017-08-15.1]
   haproxy-bundle-docker-0	(ocf::heartbeat:docker):	Started controller-1
   haproxy-bundle-docker-1	(ocf::heartbeat:docker):	Started controller-2
   haproxy-bundle-docker-2	(ocf::heartbeat:docker):	Started controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 1 Damien Ciabrini 2017-08-16 09:17:06 UTC

Looking at the sequence of event that happened on node messaging-0 while resource rabbitmq-clone was being deleted, I can see the following unexpected log:

Aug 15 22:48:10 messaging-0 crmd[16468]:   notice: Result of stop operation for rabbitmq on messaging-0: 0 (ok)
Aug 15 22:48:24 messaging-0 ansible-pacemaker_resource[175956]: Invoked with check_mode=False state=delete resource=rabbitmq timeout=300 wait_for_resource=True
Aug 15 22:48:25 messaging-0 cib[16463]:    error: IDREF attribute rsc references an unknown ID "rabbitmq-clone"
Aug 15 22:48:25 messaging-0 cib[16463]:    error: IDREF attribute rsc references an unknown ID "rabbitmq-clone"
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Updated CIB does not validate against pacemaker-2.8 schema/dtd
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Local-only Change (client:cibadmin, call: 2): 0.78.0 (Update does not conform to the configured schema)
Aug 15 22:48:25 messaging-0 cib[16463]:  warning: Completed cib_delete operation for section //clone/primitive[@id="rabbitmq"]/..: Update does not conform to the configured schema (rc=-203, 

So there might be several things at play here which ultimately make the resource deletion fail, and apparently went unnoticed.

Comment 2 Damien Ciabrini 2017-08-16 13:43:22 UTC

Created https://bugzilla.redhat.com/show_bug.cgi?id=1482116 to handle concurrent CIB updates with pcs.

Comment 3 Damien Ciabrini 2017-10-11 11:53:42 UTC

putting that bug to modified here because it shouldn't reoccur once  https://bugzilla.redhat.com/show_bug.cgi?id=1482116 is verified

Comment 4 Chris Jones 2017-10-24 15:16:23 UTC

Marius: Given Damien's comment #3 and the linked bug being VERIFIED, do you have any objections to moving this beyond MODIFIED?

Comment 5 Marius Cornea 2017-10-24 15:18:43 UTC

(In reply to Chris Jones from comment #4)
> Marius: Given Damien's comment #3 and the linked bug being VERIFIED, do you
> have any objections to moving this beyond MODIFIED?

No objections, we can also close this as a duplicate of bug 1482116.

Comment 6 Chris Jones 2017-10-25 09:43:41 UTC


*** This bug has been marked as a duplicate of bug 1482116 ***

Note You need to log in before you can comment on or make changes to this bug.