Bug 1661806 - Summary: After rolling controllers reboot - neutron_server/cinder_scheduler -> rabbitmq connections are not stable
Summary: Summary: After rolling controllers reboot - neutron_server/cinder_scheduler ...
Keywords:
Status: CLOSED DUPLICATE of bug 1640804
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: pkomarov
URL:
Whiteboard:
Depends On:
Blocks: 1592528
TreeView+ depends on / blocked
 
Reported: 2018-12-23 14:05 UTC by Raviv Bar-Tal
Modified: 2019-06-10 10:54 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-10 10:54:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
noav logs (931.98 KB, text/plain)
2018-12-23 14:18 UTC, Raviv Bar-Tal
no flags Details
rabbitmq log (10.26 MB, text/plain)
2018-12-23 14:20 UTC, Raviv Bar-Tal
no flags Details
neutron log (9.12 MB, text/plain)
2018-12-23 14:25 UTC, Raviv Bar-Tal
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 439266 0 None None None 2019-01-06 07:29:00 UTC

Description Raviv Bar-Tal 2018-12-23 14:05:22 UTC
Description of problem:
After overcloud nova fail to spawn new vm with this error:
containers/nova/nova-compute.log:2018-12-23 13:24:07.381 1 ERROR nova.compute.manager [req-a204ffc7-5cc6-4efe-8bce-4691fb7bd6d7 dd860c0f95444087a00c2d79a1975c9c fd31f0231871
42b596a6fa8abbe229a1 - default default] [instance: f29a6e8f-90bd-498a-9bbd-41bb8be2f598] Instance failed to spawn: PortBindingFailed: Binding failed for port 63cd0f42-c0c5-4
bad-9e29-528314e98236, please check neutron logs for more information.

In nuetron log I can see it reconnecting to rabbitmq:

2018-12-23 13:24:08.832 26 INFO neutron.wsgi [req-507d735e-ed17-4ba3-be71-7f5a2d5ca4e3 dd860c0f95444087a00c2d79a1975c9c fd31f023187142b596a6fa8abbe229a1 - default default] 172.17.1.36 "GET /v2.0/ports?device_id=f29a6e8f-90bd-498a-9bbd-41bb8be2f598 HTTP/1.1" status: 200  len: 186 time: 0.0640490
2018-12-23 13:24:09.796 30 INFO oslo.messaging._drivers.impl_rabbit [req-a9264b1f-51a2-4577-9e11-c25261b41c10 - - - - -] [7cbdbdfb-c6a6-4430-a5b7-3a26043170ed] Reconnected to AMQP server on controller-0.internalapi.localdomain:5672 via [amqp] client with port 52404.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install osp14 RC puddle
2. Reboot the overcloud (ir reboot)
3. Try to spawn new VM - (server create)

Actual results:
creation of new vm fails

Expected results:


Additional info:

Comment 1 Raviv Bar-Tal 2018-12-23 14:18:25 UTC
Created attachment 1516364 [details]
noav logs

Comment 2 Raviv Bar-Tal 2018-12-23 14:20:39 UTC
Created attachment 1516365 [details]
rabbitmq log

Comment 3 Raviv Bar-Tal 2018-12-23 14:25:23 UTC
Created attachment 1516366 [details]
neutron log

Comment 4 Raviv Bar-Tal 2018-12-23 14:27:07 UTC
more log can be found on:

http://ikook.tlv.redhat.com/uploads/update/osp14/1661806/

Comment 5 Daniel Alvarez Sanchez 2018-12-24 11:24:38 UTC
I downloaded logs and checked that rabbit bundle is not able to start after the reboot:


Controller 0:

Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ DIAGNOSTICS ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ =========== ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-0'] ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ rabbit@controller-0: ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * connected to epmd (port 4369) on controller-0 ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * epmd reports: node 'rabbit' not running at all ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [                   no other nodes on controller-0 ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * suggestion: start the node ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ current node details: ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-59@controller-0' ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Dec 23 12:13:20 [12] controller-0 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:start call_id:13 pid:183 exit-code:0 exec-time:26720ms queue-time:0ms
Dec 23 12:13:30 [12] controller-0 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:31
Dec 23 12:13:31 [12] controller-0 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:31 pid:2267 exit-code:0 exec-time:761ms queue-time:0ms



Controller 1:

Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ DIAGNOSTICS ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ =========== ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ rabbit@controller-1: ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * connected to epmd (port 4369) on controller-1 ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * epmd reports: node 'rabbit' not running at all ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [                   no other nodes on controller-1 ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * suggestion: start the node ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ current node details: ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-85@controller-1' ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Dec 23 12:06:52 [12] controller-1 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:start call_id:13 pid:183 exit-code:0 exec-time:23795ms queue-time:0ms
Dec 23 12:07:03 [12] controller-1 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:32
Dec 23 12:07:04 [12] controller-1 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:32 pid:2120 exit-code:0 exec-time:728ms queue-time:0ms
Dec 23 12:12:16 [12] controller-1 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:286
Dec 23 12:12:17 [12] controller-1 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:286 pid:11269 exit-code:0 exec-time:699ms queue-time:0ms
Dec 23 12:12:33 [12] controller-1 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:300
Dec 23 12:12:34 [12] controller-1 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:300 pid:11750 exit-code:0 exec-time:732ms queue-time:0ms
Dec 23 12:12:38 [12] controller-1 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:301
Dec 23 12:12:45 [12] controller-1 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:301 pid:12236 exit-code:0 exec-time:7251ms queue-time:0ms
Dec 23 12:12:49 [12] controller-1 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:316
Dec 23 12:12:49 [12] controller-1 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:316 pid:12244 exit-code:0 exec-time:715ms queue-time:0ms
Dec 23 12:13:31 [12] controller-1 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:346
Dec 23 12:13:35 [12] controller-1 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:346 pid:13696 exit-code:0 exec-time:3485ms queue-time:0ms
Dec 23 12:17:43 [12] controller-1 pacemaker_remoted:  warning: qb_ipcs_event_sendv:     new_event_notification (12-20924-17): Broken pipe (32)
Dec 23 12:17:43 [12] controller-1 pacemaker_remoted:  warning: send_client_notify:      Could not notify client proxy-cib_rw-20924-63f76b1f/63f76b1f-d917-4835-9664-56e4de96241



Controller 2:
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-2'] ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ rabbit@controller-2: ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * connected to epmd (port 4369) on controller-2 ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * epmd reports: node 'rabbit' not running at all ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [                   no other nodes on controller-2 ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * suggestion: start the node ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ current node details: ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-46@controller-2' ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ Error: unable to connect to node 'rabbit@controller-2': no
dedown ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ DIAGNOSTICS ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ =========== ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-2'] ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ rabbit@controller-2: ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * connected to epmd (port 4369) on controller-2 ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * epmd reports: node 'rabbit' not running at all ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [                   no other nodes on controller-2 ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [   * suggestion: start the node ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ current node details: ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-48@controller-2' ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [  ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:   notice: operation_finished:      rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ]
Dec 23 12:02:32 [12] controller-2 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:start call_id:13 pid:183 exit-code:0 exec-time:24863ms queue-time:0ms
Dec 23 12:02:44 [12] controller-2 pacemaker_remoted:     info: log_execute:     executing - rsc:rabbitmq action:notify call_id:32
Dec 23 12:02:45 [12] controller-2 pacemaker_remoted:     info: log_finished:    finished - rsc:rabbitmq action:notify call_id:32 pid:2119 exit-code:0 exec-time:728ms queue-time:0ms



Not sure if this may be a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1656368  /  https://bugzilla.redhat.com/show_bug.cgi?id=1656368

Comment 6 pkomarov 2018-12-24 11:38:15 UTC
I see : 
1) on controller2: 
2018-12-23 12:47:33.306 1 INFO nova.compute.manager [req-577013c5-4241-4fb6-9204-0e1673529906 - - - - -] [instance: 494c82a5-9284-44c4-b1dc-1baa39791e96] During sync_power_state the instance has a pending task (spawning). Skip.
2018-12-23 12:47:46.244 1 ERROR oslo.messaging._drivers.impl_rabbit [req-c1b0f7cb-d8fa-45e1-9684-a36f224542fe - - - - -] [2cd23e8d-610b-483a-bc08-ce00c4f49bd8] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out

2) on compute-0 : 
/var/log/containers/nova/nova-compute.log.1:2018-12-23 12:49:30.056 1 ERROR nova.compute.manager [req-583fa0df-11f0-48e0-8a19-57d6cac7b88f dd860c0f95444087a00c2d79a1975c9c fd31f023187142b596a6fa8abbe229a1 - default default] [instance: b0de420d-a108-4ecb-82b3-d01407f3423b] Instance failed to spawn: VirtualInterfaceCreateException: Virtual Interface creation failed 

so from 1) I see the instance flailing to recover from a reboot , then on 2) I see that two seconds later  the virtual interface for that instance can't be created .

Maybe this is a race condition when the compute tries to start up a vm after a reboot but the network infrastructure is not all up yet ...

Comment 7 pkomarov 2018-12-24 11:39:21 UTC
Note that after some time rebbitmq does manage to reconnect : 

2018-12-23 14:08:27.945 1 INFO oslo.messaging._drivers.impl_rabbit [req-c1b0f7cb-d8fa-45e1-9684-a36f224542fe - - - - -] [2cd23e8d-610b-483a-bc08-ce00c4f49bd8] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 47646.
2018-12-23 14:08:52.451 1 ERROR oslo.messaging._drivers.impl_rabbit [req-dbaaf87b-1878-4e65-b6f1-1fa9a4ec723a 4c9abe3ffad741108895b6b01035b0b4 2f06dfd71b464c10a004b3c6eb3137d9 - default default] [872ee446-5c90-4cec-aa10-7a7c01bbc7ce] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
2018-12-23 14:08:53.466 1 INFO oslo.messaging._drivers.impl_rabbit [req-dbaaf87b-1878-4e65-b6f1-1fa9a4ec723a 4c9abe3ffad741108895b6b01035b0b4 2f06dfd71b464c10a004b3c6eb3137d9 - default default] [872ee446-5c90-4cec-aa10-7a7c01bbc7ce] Reconnected to AMQP server on controller-0.internalapi.localdomain:5672 via [amqp] client with port 37638.

Comment 8 Daniel Alvarez Sanchez 2018-12-24 11:50:37 UTC
Right on, I saw that too. Still looks like 2 hours after reboot (not sure if times in logs are in sync though) is way too much.
@Roee or anybody, can you check if you can boot a server now? If it's just a matter of few minutes for the cluster to be brought up, I'd perhaps remove the blocker flag and document it.

If it's a different issue, it'd be nice to confirm whether it has to be with the resource-agents version. I'm trying on my OSP14 setup which has the 'problematic' version inside the bundle container:

[root@controller-2 ~]# rpm -qa|grep resource-agents
resource-agents-4.1.1-12.el7_6.6.x86_64
[root@controller-2 ~]# docker ps|grep rabbit
754b65a943ec        192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest                    "/bin/bash /usr/lo..."   6 days ago          Up 6 days                                 rabbitmq-bundle-docker-2

[root@controller-2 ~]# docker exec -uroot  -it 754b65a943ec bash
()[root@controller-2 /]# rpm -qa|grep resource-agent
resource-agents-4.1.1-12.el7_6.4.x86_64

After restarting all controllers I can see that the rabbit cluster is healthy:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2


[root@controller-2 neutron]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.6 (Maipo)
[root@controller-2 neutron]# uname -a
Linux controller-2 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

I also saw that there's a new version available of resource-agents:
Resolving Dependencies
--> Running transaction check
---> Package resource-agents.x86_64 0:4.1.1-12.el7_6.6 will be updated
---> Package resource-agents.x86_64 0:4.1.1-12.el7_6.7 will be an update
--> Finished Dependency Resolution

Comment 9 pkomarov 2018-12-24 12:29:39 UTC
Raviv , can you elaborate on : 
Steps to Reproduce:

2. Reboot the overcloud (ir reboot)-> does this mean : ir tripleo-overcloud --postreboot: true ? , if not what what the way you rebooted the overcloud ?

3. Try to spawn new VM - (server create) - >time frame here? , How much time passed after the reboots, before you tried to build an instance ?

thanks.

Comment 10 Raviv Bar-Tal 2018-12-25 08:45:41 UTC
HI Pini
1. This is the full command that I use for the reboot:
   infrared tripleo-overcloud -o overcloud-reboot.yml --postreboot True --postreboot_evacuate yes --deployment-files composable_roles --overcloud-stack overcloud  -e @cleanup_services.ymlt

2. I waited 10-15 minutes before trying to spawn  new server

Comment 11 pkomarov 2018-12-25 13:38:57 UTC
ok , so  infrared tripleo-overcloud --postreboot True:
 https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml

does a lot of additional actions besides the reboot of the overcloud , basically :

for node in overcloud_nodes :
 
- checks if there are active instances to be migrated
- checks the overall overcloud stable state
- sets cephs to rebalance=off
- shuts down pacemaker cluster on each controller with --force and --timeout=5m 
- reboots all the overcloud here : (3controllers,1comp,3ceph)
- waits until the node is up in the node
- waits until pacemaker is up (in case it's a controller)
- do ceph health checks
- migrate any previous instances

So my problem with this scenario (pacemaker wise) is with : 
"shuts down pacemaker cluster on each controller with --force and --timeout=5m "
is --force necessary ? and why not do "pcs cluster stop --all" on one controller so that pacemaker shutdown is nice and synchronised ?
since this is a soft reboot testing scenario (all services are checked and soft handled before and after according to official documentation:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14-beta/html/director_installation_and_usage/sect-rebooting_the_overcloud#rebooting_controller_nodes

So since "infrared tripleo-overcloud --postreboot True" does pacemaker shutdown Not according to documentation, 
I would repeat this test with two scenarios differences : 
1) reboot only controller and compute nodes 
2) reboot the whole overcloud
Both manually with "pcs cluster stop --all" with the condition that command executes successfully.

I'll try to reproduce it myself, feel free too

Comment 12 Eran Kuris 2018-12-27 07:45:29 UTC
BTW it reproduces on OVN deployment too.

Comment 13 pkomarov 2018-12-27 10:25:55 UTC
Issue reproduces : 
no need for the full restart scenario: 

reproduce actions: 
1) choose one controller, do pcs cluster stop,
2) wait ,
3)then reboot
4)wait , see that resources in Stoppeed mode are : 
rabbitmq-bundle-1, galera-bundle-1,redis-bundle-1,haproxy-bundle-docker-1

so this is a bundle issue.

and not a duplicate of : 
https://bugzilla.redhat.com/show_bug.cgi?id=1656368
since this bug's fixes are in that one: 
http://pastebin.test.redhat.com/688426
[stack@undercloud-0 ~]$ ansible 

controller-1 -mshell -b -a'rpm -qa|grep resource-agents'

controller-1 | SUCCESS | rc=0 >>
resource-agents-4.1.1-12.el7_6.7.x86_64

Comment 14 pkomarov 2018-12-27 14:06:59 UTC
#checking using resource debug-start:

#pcs resource debug-start rabbitmq --full => 
http://pastebin.test.redhat.com/688460

From the output and resource-agent/rabbitmq-cluster flow:

resource starts -> rmq_try_start() ->  rmq_monitor() ...

 >  stderr: + 13:51:39: rmq_monitor:200: /usr/sbin/rabbitmqctl eval 'rabbit_mnesia:cluster_status_from_mnesia().' 

 >  stderr: /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 200: /usr/sbin/rabbitmqctl: No such file or directory 

 >  stderr: + 13:51:39: __ha_log:248: echo 'rabbitmq-cluster(rabbitmq)[434749]:	Dec' 27 13:51:39 'INFO: RabbitMQ server could not get cluster status from mnesia' 

and all goes downhill from there..

Comment 15 pkomarov 2018-12-27 15:59:25 UTC
#guessing this is missing : 
(provides rabbitmqctl)
yum install -y rabbitmq-server-3.6.16-1.el7ost.noarch
(previous error does away)

but still , node bundles don't survive a reboot ,
they do however survive pcs cluster stop && pcs cluster start.

maybe recent fixes around this have some to do with it : 
(heartbeat/rabbitmq-cluster)
rmq_delete_nodename()
...
        ${HA_SBIN_DIR}/crm_attribute -N $NODENAME -l reboot --name "$RMQ_CRM_ATTR_COOKIE" -D

Comment 16 pkomarov 2018-12-28 14:51:17 UTC
update : 

doing "pcs resource refresh --force"  or "pcs resource cleanup" seems to clear up the error count so that the bundles start one by one 

before the following was logged in corosync.log : 
#after reboot : 
Dec 28 03:59:09 [594832] controller-0    pengine:     info: determine_online_status:    Node controller-1 is online
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: unpack_rsc_op_failure:      Processing failed start of rabbitmq-bundle-docker-1 on controller-1: unknown error | rc=1
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: unpack_rsc_op_failure:      Processing failed start of galera-bundle-docker-1 on controller-1: unknown error | rc=1
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: unpack_rsc_op_failure:      Processing failed start of redis-bundle-docker-1 on controller-1: unknown error | rc=1
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: unpack_rsc_op_failure:      Processing failed start of haproxy-bundle-docker-1 on controller-1: unknown error | rc=1
Dec 28 03:59:09 [594832] controller-0    pengine:     info: pe_get_failcount:   Container rabbitmq-bundle-docker-1 and the resources within it have failed INFINITY times on controller-1
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: check_migration_threshold:  Forcing rabbitmq-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000)
Dec 28 03:59:09 [594832] controller-0    pengine:     info: pe_get_failcount:   Container galera-bundle-docker-1 and the resources within it have failed INFINITY times on controller-1
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: check_migration_threshold:  Forcing galera-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000)
Dec 28 03:59:09 [594832] controller-0    pengine:     info: pe_get_failcount:   Container redis-bundle-docker-1 and the resources within it have failed INFINITY times on controller-1
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: check_migration_threshold:  Forcing redis-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000)
Dec 28 03:59:09 [594832] controller-0    pengine:     info: pe_get_failcount:   haproxy-bundle-docker-1 has failed INFINITY times on controller-1
Dec 28 03:59:09 [594832] controller-0    pengine:  warning: check_migration_threshold:  Forcing haproxy-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000)
Dec 28 03:59:09 [594832] controller-0    pengine:     info: RecurringOp:         Start recurring monitor (60s) for rabbitmq-bundle-1 on controller-1
Dec 28 03:59:09 [594832] controller-0    pengine:     info: RecurringOp:         Start recurring monitor (60s) for galera-bundle-1 on controller-1
Dec 28 03:59:09 [594832] controller-0    pengine:     info: RecurringOp:         Start recurring monitor (60s) for redis-bundle-1 on controller-1
Dec 28 03:59:09 [594832] controller-0    pengine:   notice: LogAction:   * Start      rabbitmq-bundle-1                    (                      controller-1 )   due to unrunnable rabbitmq-bundle-docker-1 start (blocked)
Dec 28 03:59:09 [594832] controller-0    pengine:   notice: LogAction:   * Start      galera-bundle-1                      (                      controller-1 )   due to unrunnable galera-bundle-docker-1 start (blocked)
Dec 28 03:59:09 [594832] controller-0    pengine:   notice: LogAction:   * Start      redis-bundle-1                       (                      controller-1 )   due to unrunnable redis-bundle-docker-1 start (blocked)

Comment 17 pkomarov 2018-12-29 08:05:41 UTC
update , so after a reboot
all of a specific node's bundles 
have a fail count which which then keads pacemaker to force the bundle resource away from it's controller.

workaround : pcs resource refresh --force --node controller-1

this way bundled resources' failures are rechecked and the bundles come back online on controller-1

two issues here: 
- why is resource refresh behaviour not handled on a controller reboot , so that correct reprobe of the resources is done on boot..
- resource max failures is reached but then it's not handled by any crm reprobe action ?

Comment 19 pkomarov 2019-01-02 13:32:49 UTC
not reproducible in latest passed_phase2 puddle fixed in : core_puddle_version :2018-12-20.4
#testing restart of pacemaker and bundles: 
http://pastebin.test.redhat.com/689808
#testing of booting a nova instance after reboot of a controller:
http://pastebin.test.redhat.com/689810

Comment 20 Daniel Alvarez Sanchez 2019-01-02 14:18:16 UTC
Thanks Pini for taking care of this! Do you happen to know what the actual fix has been?

Comment 21 pkomarov 2019-01-02 15:18:36 UTC
I can't pinpoint the exact fix,
and since the reproducer is manual is difficult to establish a valid testing baseline here.
 
I've been monitoring pacemaker's bundles (rabbit,etc) and corosync logs with reboots and multiple scenarios ,
from what I could see pacemaker did not recheck bundles' state and considered them still not eligable to run, 
on later testing this was followed by a pacemaker's monitoring operations kicking into place and restarting 
of the bundles.

Comment 22 Raviv Bar-Tal 2019-01-03 07:38:13 UTC
The Bug was just reproduced on titan38, 
You are welcome to check it.

Comment 23 pkomarov 2019-01-03 09:49:03 UTC
#from rabbitmq logs:
=WARNING REPORT==== 3-Jan-2019::09:27:20 ===
closing AMQP connection <0.25933.15> (172.17.1.11:35978 -> 172.17.1.39:5672 - neutron-server:35:64ccbfd3-f02f-4661-b80e-a910227423a1, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

#from neytron server logs: <- alot of reconnects 
2019-01-03 09:30:54.000 31 ERROR oslo.messaging._drivers.impl_rabbit [req-d3576839-62fa-46dc-b261-a471437fb914 - - - - -] [dd6e5a63-fb1a-47d1-b613-471e6db192cd] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out

#rabbit logs show alot of : <- these maybe related to a reboot 
=ERROR REPORT==== 3-Jan-2019::09:00:40 ===
Discarding message {'$gen_cast',{deliver,{delivery,false,true,<0.2096.15>,{basic_message,{resource,<<"/">>,exchange,<<"q-server-resource-versions_fanout">>},[<<>>],{content,60,{'P_basic',<<"application/json">>,<<"utf-8">>,[],2,0,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<248,0,16,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110,5,117,116,102,45,56,0,0,0,0,2,0>>,rabbit_framing_amqp_0_9_1,[<<"{\"oslo.message\": \"{\\"_context_domain\\": null, \\"_context_request_id\\": \\"req-d3576839-62fa-46dc-b261-a471437fb914\\", \\"_context_global_request_id\\": null, \\"_context_auth_token\\": null, \\"_context_resource_uuid\\": null, \\"_context_tenant_name\\": null, \\"_context_user\\": null, \\"_context_user_id\\": null, \\"_context_system_scope\\": null, \\"_context_show_deleted\\": false, \\"_context_is_admin\\": true, \\"version\\": \\"1.0\\", \\"_context_project_domain\\": null, \\"_context_timestamp\\": \\"2019-01-02 22:36:56.666997\\", \\"method\\": \\"report_agent_resource_versions\\", \\"_context_project\\": null, \\"_context_roles\\": [], \\"args\\": {\\"version_map\\": {\\"Subnet\\": \\"1.0\\", \\"Network\\": \\"1.0\\", \\"PortForwarding\\": \\"1.0\\", \\"SubPort\\": \\"1.0\\", \\"SecurityGroup\\": \\"1.0\\", \\"SecurityGroupRule\\": \\"1.0\\", \\"Trunk\\": \\"1.1\\", \\"QosPolicy\\": \\"1.7\\", \\"Port\\": \\"1.4\\", \\"Log\\": \\"1.0\\"}, \\"agent_type\\": \\"Open vSwitch agent\\", \\"agent_host\\": \\"controller-1.localdomain\\"}, \\"_unique_id\\": \\"ac1c4525959f4a53b636b01611e4dc0a\\", \\"_context_tenant_id\\": null, \\"_context_is_admin_project\\": true, \\"_context_project_name\\": null, \\"_context_user_identity\\": \\"- - - - -\\", \\"_context_tenant\\": null, \\"_context_project_id\\": null, \\"_context_read_only\\": false, \\"_context_user_domain\\": null, \\"_context_user_name\\": null}\", \"oslo.version\": \"2.0\"}">>]},<<55,49,240,202,110,182,239,112,255,54,215,95,92,238,64,243>>,true},1,flow},false}} from <0.2096.15> to <0.3614.0> in an old incarnation (1) of this node (2)

#after I did a overcloud pacemaker cluster restart (pcs cluster stop --all&&pcs cluster start --all)
Now things seem back to normal

#now server creation is succesfull:
(overcloud) [stack@undercloud-0 ~]$ openstack server create --flavor m1.micro --image cirros-0.3.5-x86_64-disk.img --network 5104edf3-b3f3-4adc-ad2a-124336adc3fa osvm2 --wait

+-------------------------------------+---------------------------------------------------------------------+
| Field                               | Value                                                               |
+-------------------------------------+---------------------------------------------------------------------+
| OS-DCF:diskConfig                   | MANUAL                                                              |
| OS-EXT-AZ:availability_zone         | nova                                                                |
| OS-EXT-SRV-ATTR:host                | compute-1.localdomain                                               |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.localdomain                                               |
| OS-EXT-SRV-ATTR:instance_name       | instance-00000301                                                   |
| OS-EXT-STS:power_state              | Running                                                             |
| OS-EXT-STS:task_state               | None                                                                |
| OS-EXT-STS:vm_state                 | active                                                              |
| OS-SRV-USG:launched_at              | 2019-01-03T09:43:33.000000                                          |
| OS-SRV-USG:terminated_at            | None                                                                |
| accessIPv4                          |                                                                     |
| accessIPv6                          |                                                                     |
| addresses                           | internal_net_b326fd44c2=192.168.0.14                                |
| adminPass                           | 98YDt4LTeD5y                                                        |
| config_drive                        |                                                                     |
| created                             | 2019-01-03T09:40:18Z                                                |
| flavor                              | m1.micro (a2c6130a-5f98-4038-8914-5e75b8b873ea)                     |
| hostId                              | 9af54f0512fbc853856e62f4eaabbc049bfb186a33f2139f1f5cfcde            |
| id                                  | 9cad3922-7886-48e0-87b1-fc73000ae02b                                |
| image                               | cirros-0.3.5-x86_64-disk.img (60c3bd60-8309-421e-8a16-e7f9af7219e3) |
| key_name                            | None                                                                |
| name                                | osvm2                                                               |
| progress                            | 0                                                                   |
| project_id                          | 851bf1c1256b438191e0516ed551813d                                    |
| properties                          |                                                                     |
| security_groups                     | name='default'                                                      |
| status                              | ACTIVE                                                              |
| updated                             | 2019-01-03T09:43:33Z                                                |
| user_id                             | 8940592d97c44117a685ab75d702855d                                    |
| volumes_attached                    |                                                                     |
+-------------------------------------+---------------------------------------------------------------------+

Comment 24 pkomarov 2019-01-03 15:19:30 UTC
after manually reproducing the overcloud reboot steps in : https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml
I can say manually, I do not see the errors that are present after that playbook is run,
I pcs cluster stopped each controller and made sure it's up and pacemaker healthy in turn .
then nova live migrated each compute's instances and rebooted it.
All actions have been successful ,all pacemaker resources and all instances are Active an healthy after the manual procedure.
In addition the step where the ansible playbook Failed passed for me(in the action logs).

manual actions and results log: 
http://pastebin.test.redhat.com/690459
sos reports during the procedure:
http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ1661806/

My understanding is that this is an automation error (blaming the overcloud_reboot ansible playbook) rather than a bug , but to be sure , 
Raviv can you run the ir reboot procedure a couple of times and report the results ?

Comment 25 Raviv Bar-Tal 2019-01-07 09:32:43 UTC
The IR reboot command failed again in the same place and with the same error.
That been said, the system is not functional and all services are constantly reconnecting to AMQP server.

=WARNING REPORT==== 7-Jan-2019::09:01:23 ===
closing AMQP connection <0.25239.5059> (172.17.1.19:42468 -> 172.17.1.11:5672 - neutron-server:33:57e91aed-bbb0-4ee8-9760-66a2e1f5d94f, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

=WARNING REPORT==== 7-Jan-2019::09:01:23 ===
closing AMQP connection <0.4118.4994> (172.17.1.19:42466 -> 172.17.1.11:5672 - neutron-server:33:e0c86681-7785-4b37-a496-4183a1f16a16, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

=WARNING REPORT==== 7-Jan-2019::09:01:23 ===
closing AMQP connection <0.10045.5532> (172.17.1.19:42476 -> 172.17.1.11:5672 - neutron-server:32:f428ac86-a9b0-492c-a606-435642ff9a79, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

=WARNING REPORT==== 7-Jan-2019::09:01:23 ===
closing AMQP connection <0.1599.4801> (172.17.1.19:42486 -> 172.17.1.11:5672 - neutron-server:34:ecb32873-5d9f-4574-ac0b-998e1e31afdb, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

=INFO REPORT==== 7-Jan-2019::09:01:24 ===
accepting AMQP connection <0.9040.4835> (172.17.1.19:55848 -> 172.17.1.11:5672)


pcs status show no error:
[root@controller-2 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum
Last updated: Mon Jan  7 09:28:29 2019
Last change: Thu Jan  3 15:47:05 2019 by redis-bundle-1 via crm_attribute on controller-1

12 nodes configured
38 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq$bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ]

Full list of resources:

 Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-0
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-1
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller-2
 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Slave controller-0
   redis-bundle-1       (ocf::heartbeat:redis): Master controller-1
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2
 ip-192.168.24.14       (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-10.0.0.101  (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.1.13 (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-172.17.1.25 (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-172.17.3.11 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.4.16 (ocf::heartbeat:IPaddr2):       Started controller-2
 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest]
   haproxy-bundle-docker-0      (ocf::heartbeat:docker):        Started controller-0
   haproxy-bundle-docker-1      (ocf::heartbeat:docker):        Started controller-1
   haproxy-bundle-docker-2      (ocf::heartbeat:docker):        Started controller-2
 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-docker-0     (ocf::heartbeat:docker):        Started controller-1
 Docker container: openstack-cinder-backup [192.168.24.1:8787/rhosp14/openstack-cinder-backup:pcmklatest]
   openstack-cinder-backup-docker-0     (ocf::heartbeat:docker):        Started controller-1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 26 pkomarov 2019-01-07 11:26:24 UTC
Seeing that env, on the neutron and rabbitmq : 

(undercloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'grep ERROR /var/log/containers/neutron/*|tail'
 [WARNING]: Found both group and host with same name: undercloud

compute-1 | SUCCESS | rc=0 >>
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     call_monitor_timeout, retry=retry)
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     call_monitor_timeout)
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     message = self.waiters.get(msg_id, timeout=timeout)
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent     'to message ID %s' % msg_id)
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID dc939b8e46794bdfa8300a90da45cffa
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 

compute-0 | SUCCESS | rc=0 >>
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(f7e5c65cb22b4238a2be9016fa2f9e5c). Skipping it.
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 361, in _callback
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit     self.callback(RabbitMessage(message))
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 251, in __call__
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit     unique_id = self.msg_id_cache.check_duplicate_message(message)
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py", line 122, in check_duplicate_message
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit     raise rpc_common.DuplicateMessageError(msg_id=msg_id)
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(f7e5c65cb22b4238a2be9016fa2f9e5c). Skipping it.
/var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit 

controller-0 | SUCCESS | rc=0 >>
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.041 33 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [2011ceb7-70d9-4b47-9ca0-293e7f203623] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.066 31 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [fffac77f-a0e8-4b80-bcd2-ffd3f76c80ea] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:17.716 34 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [9add6b67-992c-4fc1-b332-9392e1397f1b] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:17.731 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [03c8ae06-0fae-4a57-928a-cc9819f81f5e] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:18.164 33 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [d4298671-2063-46fe-a0c7-e64592d67878] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:19.955 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [41168b13-0683-4986-8d69-c02fc11c6d38] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.120 35 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [3c9b58b2-0daf-48c7-82c5-c26490679d30] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.421 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [41950f6c-0346-42a3-bbe5-17599ee47c27] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.432 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [98042e42-8390-4a0e-808e-3979084f0aa6] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.778 35 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [b1be6646-09e0-4932-a664-3f27957f12ed] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out

controller-1 | SUCCESS | rc=0 >>
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:23.669 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [e8f33ad0-b4d8-4371-af3a-e7c140f387ba] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:24.281 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [beb0bdf4-dad1-46df-91b1-93f0a8308e8f] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:25.254 33 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [06f8beac-c9f7-4ca9-9d7f-d24ce59d17ac] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:25.718 32 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [77e6b69d-9036-4611-b6ef-88d2477b977d] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:25.770 32 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [1ae40b83-e99a-417f-9839-b47051088172] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:26.190 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [b87d6ddd-7d88-4b20-8ef0-2f6487a3c667] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:27.007 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [d8854703-a6f6-46b9-88f9-98fb364082d9] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:27.173 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [16dd3d75-464f-4df0-acaf-7fc7f6631a9b] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:27.382 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [0ae351cf-2e12-4504-b1b7-ae6a49799f2e] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:00:28.744 33 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [e0c86681-7785-4b37-a496-4183a1f16a16] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out

controller-2 | SUCCESS | rc=0 >>
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:11.822 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [097f69c1-cd44-4fa4-a475-5e364738aa7c] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:11.926 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [95687baa-4423-46b3-8469-26a40ca1f6a3] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:12.200 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [30c3c84e-4725-453e-9f2d-6112f2544a66] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:12.774 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [a2dd661d-bd30-4a0c-9c3d-88af83f89fe2] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:14.945 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [7896a37d-a872-4db3-bbea-ef256351c868] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:15.193 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [99c8b542-c0e7-4b2e-b8c7-3c88ad7cc710] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:15.326 30 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [7adfc2f1-ea84-441e-8788-8212b858c95e] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:15.826 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [e1261124-0fa6-43db-be24-c699ff07251e] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.430 33 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [1e4ff701-48dc-4c90-80a0-4b06dd7d42e6] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out
/var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.481 34 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [f1cb0285-5edc-4297-912e-9cebc15acf30] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed

ceph-0 | SUCCESS | rc=0 >>
grep: /var/log/containers/neutron/*: No such file or directory

ceph-1 | SUCCESS | rc=0 >>
grep: /var/log/containers/neutron/*: No such file or directory

ceph-2 | SUCCESS | rc=0 >>
grep: /var/log/containers/neutron/*: No such file or directory

(undercloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"grep 'closing\|unexpectedly' /var/log/containers/rabbitmq/*|tail"
 [WARNING]: Found both group and host with same name: undercloud

compute-1 | SUCCESS | rc=0 >>
grep: /var/log/containers/rabbitmq/*: No such file or directory

compute-0 | SUCCESS | rc=0 >>
grep: /var/log/containers/rabbitmq/*: No such file or directory

controller-0 | SUCCESS | rc=0 >>
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.8062.161> (172.17.1.39:40134 -> 172.17.1.39:5672 - neutron-server:33:d4298671-2063-46fe-a0c7-e64592d67878, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.8872.161> (172.17.1.11:35398 -> 172.17.1.39:5672 - neutron-server:30:2d250e51-b3d3-4394-9dda-869b5d159a0b, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.9085.161> (172.17.1.11:35420 -> 172.17.1.39:5672 - neutron-server:30:ec96e54a-697c-4b78-a07a-1fce56891653, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.6866.161> (172.17.1.39:40250 -> 172.17.1.39:5672 - neutron-server:35:3c9b58b2-0daf-48c7-82c5-c26490679d30, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.8122.161> (172.17.1.39:40306 -> 172.17.1.39:5672 - neutron-server:35:b1be6646-09e0-4932-a664-3f27957f12ed, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection

controller-1 | SUCCESS | rc=0 >>
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.22472.733> (172.17.1.19:42740 -> 172.17.1.19:5672 - neutron-server:33:81d8ed32-3cae-41db-85ec-741fbddf2af9, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.2203.734> (172.17.1.19:43152 -> 172.17.1.19:5672 - neutron-server:32:303ef188-6d24-4d87-ab72-47967e91fd4b, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.23214.733> (172.17.1.39:33868 -> 172.17.1.19:5672 - neutron-server:35:ed1a7191-405b-4913-9a80-dffb16241a1e, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.11358.733> (172.17.1.19:43200 -> 172.17.1.19:5672 - neutron-server:31:c30cfaed-5263-464c-88d2-ac7100469ece, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.23112.733> (172.17.1.19:43202 -> 172.17.1.19:5672 - neutron-server:31:e8071e69-9ada-4066-b3e3-fd2f93693e23, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection

controller-2 | SUCCESS | rc=0 >>
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.25115.4939> (172.17.1.19:48384 -> 172.17.1.11:5672 - neutron-server:32:b87d6ddd-7d88-4b20-8ef0-2f6487a3c667, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.26332.4617> (172.17.1.19:48752 -> 172.17.1.11:5672 - neutron-server:34:14e9a285-1a7d-4593-9c8f-a80121864e95, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.10281.5012> (172.17.1.19:42358 -> 172.17.1.11:5672 - neutron-server:32:03a5ad79-d8c3-4d51-bcd1-4afdac27c38a, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.16716.4442> (172.17.1.39:33098 -> 172.17.1.11:5672 - neutron-server:34:c7b9189d-4d61-49f2-821f-e989c8518dab, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection
/var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.23174.5374> (172.17.1.11:55596 -> 172.17.1.11:5672 - neutron-server:32:c5fd4df7-7b6b-4e26-9e24-4047944e7e3e, vhost: '/', user: 'guest'):
/var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection

ceph-0 | SUCCESS | rc=0 >>
grep: /var/log/containers/rabbitmq/*: No such file or directory

ceph-1 | SUCCESS | rc=0 >>
grep: /var/log/containers/rabbitmq/*: No such file or directory

ceph-2 | SUCCESS | rc=0 >>
grep: /var/log/containers/rabbitmq/*: No such file or directory

Comment 27 pkomarov 2019-01-07 14:13:41 UTC
I was able to reproduce the errors here using the same playbook 

After applying the patch above the cluster came back healthy after the restart playbook.
patch : https://review.gerrithub.io/439266

Results and cluster health checks after applying the patch : 
http://pastebin.test.redhat.com/691818

Raviv, can you :
- return the pacemaker clusters to full health with : pcs cluster stop --all && pcs cluster start --all(on one controller)  
- test the ir-reboot playbook with the suggested patch ? (gerrit_change - 439266)

Thanks

Comment 29 Raviv Bar-Tal 2019-01-15 11:53:08 UTC
Jenkins jobs are still failing the same way,
We need to find better solution for this problem.
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-14-from-RC-HA-ipv4/9/console

Comment 30 pkomarov 2019-01-21 07:23:15 UTC
Update : I'm seeing a nother issue here, and this is before the Overcloud reboot, : 

This is from the ovn jobs : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/

There is an OC instance created during the Overcloud-update : this instance has a Cinder Volume which is in ERROR state, and that prevents the live migration:

Instance and volume attachment happen here : (created before the OC reboot : https://github.com/openstack/tripleo-upgrade/blob/c5038babb5966d3998e807b668c9ecc477175cb0/templates/workload_launch.sh.j2#L211)

The attachement error can be seen here: 
/var/log/containers/nova/nova-compute.log:2019-01-17 12:52:21.075 1 ERROR nova.volume.cinder [req-9cc1f5b0-023d-4eca-b071-3d102d2776d4 ec75d0e6c48c4e8e9ee9818e0b88bdf5 8fa2e29ca76f4106955124dd1dea5331 - default default] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Create attachment failed for volume 68bb0540-d8f1-4cdf-962b-7f66e2523821. Error: Unable to create attachment for volume

and also here which causes live migration to halt: 
/var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [-] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Pre live migration failed at compute-0.localdomain: RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92)
...
/var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92)

Cinder also confirms the error with the volume: 
/var/log/containers/cinder/cinder-volume.log:2019-01-17 10:23:42.514 70 ERROR cinder.volume.manager Stderr: u'  Failed to find logical volume "cinder-volumes/volume-68bb0540-d8f1-4cdf-962b-7f66e2523821"\n'

To test this with more jobs I've created an infrared commit which tries to check and remove the volume from the workload instance before the OC_reboot procedure : 
https://review.gerrithub.io/#/c/redhat-openstack/infrared/+/441197/3
You can reproduce this using : IR_GERRIT_CHANGE: 441197/3

This patch tries to do: openstack server remove volume $vol_id $server_workload_id
Results : After Overcloud update instance volume is in Error state 
http://pastebin.test.redhat.com/699078

This result reproduced again and again and I believe this is the source of the problem here : 

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/71/artifact/.sh/ir-tripleo-overcloud-reboot.log
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/70/artifact/.sh/ir-tripleo-overcloud-reboot.log

Seeing as the previous issue with rabbit-neutron post reboot is no longer the prime suspect here (this is happening pre oc_reboot), I'm changing 
this bug direction towards the error'd volume after OC_upgrade.

Comment 31 pkomarov 2019-01-22 12:26:25 UTC
Following further testing I'm resetting the original subject for this bug , the former may be a side affect.
Currently creating automation to reproduce and test this $title scenarios.

Comment 32 Raviv Bar-Tal 2019-01-22 13:07:55 UTC
This bug is reproduced with all our automation jobs,
The reboot procedure is done according to official Redhat documentation.

Comment 33 Raviv Bar-Tal 2019-02-03 07:08:45 UTC
HI PIni, All
We saw the bug reproduced with working healthy volume and without any volume at all,
This BUG should not be blocked or depend on BZ1668568

Comment 34 Takashi Kajinami 2019-03-25 00:23:56 UTC
Hi,

Any updates about this case?
I'm hitting the same bug in RHOSP13, which is reported as BZ1592528 ,
and would like to have a fix in RHOSP13 also.

I checked the attached url, but I can not find a clear reason
why changing the said yaml can be a solution here, while the problem
is happening in RabbitMQ.

Thank you,

Comment 35 Eran Kuris 2019-03-31 09:13:51 UTC
Guys we really need to get the fix to this issue.
This issue is effecting on all update/upgrades jobs.
I am adding sos report of my setup with the same problem. 
http://rhos-release.virt.bos.redhat.com/log/bz1661806


Note You need to log in before you can comment on or make changes to this bug.