Description of problem: After overcloud nova fail to spawn new vm with this error: containers/nova/nova-compute.log:2018-12-23 13:24:07.381 1 ERROR nova.compute.manager [req-a204ffc7-5cc6-4efe-8bce-4691fb7bd6d7 dd860c0f95444087a00c2d79a1975c9c fd31f0231871 42b596a6fa8abbe229a1 - default default] [instance: f29a6e8f-90bd-498a-9bbd-41bb8be2f598] Instance failed to spawn: PortBindingFailed: Binding failed for port 63cd0f42-c0c5-4 bad-9e29-528314e98236, please check neutron logs for more information. In nuetron log I can see it reconnecting to rabbitmq: 2018-12-23 13:24:08.832 26 INFO neutron.wsgi [req-507d735e-ed17-4ba3-be71-7f5a2d5ca4e3 dd860c0f95444087a00c2d79a1975c9c fd31f023187142b596a6fa8abbe229a1 - default default] 172.17.1.36 "GET /v2.0/ports?device_id=f29a6e8f-90bd-498a-9bbd-41bb8be2f598 HTTP/1.1" status: 200 len: 186 time: 0.0640490 2018-12-23 13:24:09.796 30 INFO oslo.messaging._drivers.impl_rabbit [req-a9264b1f-51a2-4577-9e11-c25261b41c10 - - - - -] [7cbdbdfb-c6a6-4430-a5b7-3a26043170ed] Reconnected to AMQP server on controller-0.internalapi.localdomain:5672 via [amqp] client with port 52404. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install osp14 RC puddle 2. Reboot the overcloud (ir reboot) 3. Try to spawn new VM - (server create) Actual results: creation of new vm fails Expected results: Additional info:
Created attachment 1516364 [details] noav logs
Created attachment 1516365 [details] rabbitmq log
Created attachment 1516366 [details] neutron log
more log can be found on: http://ikook.tlv.redhat.com/uploads/update/osp14/1661806/
I downloaded logs and checked that rabbit bundle is not able to start after the reboot: Controller 0: Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ DIAGNOSTICS ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ =========== ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-0'] ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ rabbit@controller-0: ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * connected to epmd (port 4369) on controller-0 ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * epmd reports: node 'rabbit' not running at all ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ no other nodes on controller-0 ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * suggestion: start the node ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ current node details: ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-59@controller-0' ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] Dec 23 12:13:20 [12] controller-0 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:start call_id:13 pid:183 exit-code:0 exec-time:26720ms queue-time:0ms Dec 23 12:13:30 [12] controller-0 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:31 Dec 23 12:13:31 [12] controller-0 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:31 pid:2267 exit-code:0 exec-time:761ms queue-time:0ms Controller 1: Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ DIAGNOSTICS ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ =========== ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-1'] ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ rabbit@controller-1: ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * connected to epmd (port 4369) on controller-1 ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * epmd reports: node 'rabbit' not running at all ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ no other nodes on controller-1 ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * suggestion: start the node ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ current node details: ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-85@controller-1' ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] Dec 23 12:06:52 [12] controller-1 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:start call_id:13 pid:183 exit-code:0 exec-time:23795ms queue-time:0ms Dec 23 12:07:03 [12] controller-1 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:32 Dec 23 12:07:04 [12] controller-1 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:32 pid:2120 exit-code:0 exec-time:728ms queue-time:0ms Dec 23 12:12:16 [12] controller-1 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:286 Dec 23 12:12:17 [12] controller-1 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:286 pid:11269 exit-code:0 exec-time:699ms queue-time:0ms Dec 23 12:12:33 [12] controller-1 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:300 Dec 23 12:12:34 [12] controller-1 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:300 pid:11750 exit-code:0 exec-time:732ms queue-time:0ms Dec 23 12:12:38 [12] controller-1 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:301 Dec 23 12:12:45 [12] controller-1 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:301 pid:12236 exit-code:0 exec-time:7251ms queue-time:0ms Dec 23 12:12:49 [12] controller-1 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:316 Dec 23 12:12:49 [12] controller-1 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:316 pid:12244 exit-code:0 exec-time:715ms queue-time:0ms Dec 23 12:13:31 [12] controller-1 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:346 Dec 23 12:13:35 [12] controller-1 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:346 pid:13696 exit-code:0 exec-time:3485ms queue-time:0ms Dec 23 12:17:43 [12] controller-1 pacemaker_remoted: warning: qb_ipcs_event_sendv: new_event_notification (12-20924-17): Broken pipe (32) Dec 23 12:17:43 [12] controller-1 pacemaker_remoted: warning: send_client_notify: Could not notify client proxy-cib_rw-20924-63f76b1f/63f76b1f-d917-4835-9664-56e4de96241 Controller 2: Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-2'] ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ rabbit@controller-2: ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * connected to epmd (port 4369) on controller-2 ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * epmd reports: node 'rabbit' not running at all ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ no other nodes on controller-2 ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * suggestion: start the node ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ current node details: ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-46@controller-2' ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ Error: unable to connect to node 'rabbit@controller-2': no dedown ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ DIAGNOSTICS ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ =========== ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ attempted to contact: ['rabbit@controller-2'] ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ rabbit@controller-2: ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * connected to epmd (port 4369) on controller-2 ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * epmd reports: node 'rabbit' not running at all ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ no other nodes on controller-2 ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ * suggestion: start the node ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ current node details: ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - node name: 'rabbitmq-cli-48@controller-2' ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - home dir: /var/lib/rabbitmq ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ - cookie hash: kvePeYTNt/0LxwG/36Id9Q== ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: notice: operation_finished: rabbitmq_start_0:183:stderr [ Error: {not_a_cluster_node,"The node selected is not in the cluster."} ] Dec 23 12:02:32 [12] controller-2 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:start call_id:13 pid:183 exit-code:0 exec-time:24863ms queue-time:0ms Dec 23 12:02:44 [12] controller-2 pacemaker_remoted: info: log_execute: executing - rsc:rabbitmq action:notify call_id:32 Dec 23 12:02:45 [12] controller-2 pacemaker_remoted: info: log_finished: finished - rsc:rabbitmq action:notify call_id:32 pid:2119 exit-code:0 exec-time:728ms queue-time:0ms Not sure if this may be a duplicate of: https://bugzilla.redhat.com/show_bug.cgi?id=1656368 / https://bugzilla.redhat.com/show_bug.cgi?id=1656368
I see : 1) on controller2: 2018-12-23 12:47:33.306 1 INFO nova.compute.manager [req-577013c5-4241-4fb6-9204-0e1673529906 - - - - -] [instance: 494c82a5-9284-44c4-b1dc-1baa39791e96] During sync_power_state the instance has a pending task (spawning). Skip. 2018-12-23 12:47:46.244 1 ERROR oslo.messaging._drivers.impl_rabbit [req-c1b0f7cb-d8fa-45e1-9684-a36f224542fe - - - - -] [2cd23e8d-610b-483a-bc08-ce00c4f49bd8] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out 2) on compute-0 : /var/log/containers/nova/nova-compute.log.1:2018-12-23 12:49:30.056 1 ERROR nova.compute.manager [req-583fa0df-11f0-48e0-8a19-57d6cac7b88f dd860c0f95444087a00c2d79a1975c9c fd31f023187142b596a6fa8abbe229a1 - default default] [instance: b0de420d-a108-4ecb-82b3-d01407f3423b] Instance failed to spawn: VirtualInterfaceCreateException: Virtual Interface creation failed so from 1) I see the instance flailing to recover from a reboot , then on 2) I see that two seconds later the virtual interface for that instance can't be created . Maybe this is a race condition when the compute tries to start up a vm after a reboot but the network infrastructure is not all up yet ...
Note that after some time rebbitmq does manage to reconnect : 2018-12-23 14:08:27.945 1 INFO oslo.messaging._drivers.impl_rabbit [req-c1b0f7cb-d8fa-45e1-9684-a36f224542fe - - - - -] [2cd23e8d-610b-483a-bc08-ce00c4f49bd8] Reconnected to AMQP server on controller-2.internalapi.localdomain:5672 via [amqp] client with port 47646. 2018-12-23 14:08:52.451 1 ERROR oslo.messaging._drivers.impl_rabbit [req-dbaaf87b-1878-4e65-b6f1-1fa9a4ec723a 4c9abe3ffad741108895b6b01035b0b4 2f06dfd71b464c10a004b3c6eb3137d9 - default default] [872ee446-5c90-4cec-aa10-7a7c01bbc7ce] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out 2018-12-23 14:08:53.466 1 INFO oslo.messaging._drivers.impl_rabbit [req-dbaaf87b-1878-4e65-b6f1-1fa9a4ec723a 4c9abe3ffad741108895b6b01035b0b4 2f06dfd71b464c10a004b3c6eb3137d9 - default default] [872ee446-5c90-4cec-aa10-7a7c01bbc7ce] Reconnected to AMQP server on controller-0.internalapi.localdomain:5672 via [amqp] client with port 37638.
Right on, I saw that too. Still looks like 2 hours after reboot (not sure if times in logs are in sync though) is way too much. @Roee or anybody, can you check if you can boot a server now? If it's just a matter of few minutes for the cluster to be brought up, I'd perhaps remove the blocker flag and document it. If it's a different issue, it'd be nice to confirm whether it has to be with the resource-agents version. I'm trying on my OSP14 setup which has the 'problematic' version inside the bundle container: [root@controller-2 ~]# rpm -qa|grep resource-agents resource-agents-4.1.1-12.el7_6.6.x86_64 [root@controller-2 ~]# docker ps|grep rabbit 754b65a943ec 192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest "/bin/bash /usr/lo..." 6 days ago Up 6 days rabbitmq-bundle-docker-2 [root@controller-2 ~]# docker exec -uroot -it 754b65a943ec bash ()[root@controller-2 /]# rpm -qa|grep resource-agent resource-agents-4.1.1-12.el7_6.4.x86_64 After restarting all controllers I can see that the rabbit cluster is healthy: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 [root@controller-2 neutron]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.6 (Maipo) [root@controller-2 neutron]# uname -a Linux controller-2 3.10.0-957.1.3.el7.x86_64 #1 SMP Thu Nov 15 17:36:42 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux I also saw that there's a new version available of resource-agents: Resolving Dependencies --> Running transaction check ---> Package resource-agents.x86_64 0:4.1.1-12.el7_6.6 will be updated ---> Package resource-agents.x86_64 0:4.1.1-12.el7_6.7 will be an update --> Finished Dependency Resolution
Raviv , can you elaborate on : Steps to Reproduce: 2. Reboot the overcloud (ir reboot)-> does this mean : ir tripleo-overcloud --postreboot: true ? , if not what what the way you rebooted the overcloud ? 3. Try to spawn new VM - (server create) - >time frame here? , How much time passed after the reboots, before you tried to build an instance ? thanks.
HI Pini 1. This is the full command that I use for the reboot: infrared tripleo-overcloud -o overcloud-reboot.yml --postreboot True --postreboot_evacuate yes --deployment-files composable_roles --overcloud-stack overcloud -e @cleanup_services.ymlt 2. I waited 10-15 minutes before trying to spawn new server
ok , so infrared tripleo-overcloud --postreboot True: https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml does a lot of additional actions besides the reboot of the overcloud , basically : for node in overcloud_nodes : - checks if there are active instances to be migrated - checks the overall overcloud stable state - sets cephs to rebalance=off - shuts down pacemaker cluster on each controller with --force and --timeout=5m - reboots all the overcloud here : (3controllers,1comp,3ceph) - waits until the node is up in the node - waits until pacemaker is up (in case it's a controller) - do ceph health checks - migrate any previous instances So my problem with this scenario (pacemaker wise) is with : "shuts down pacemaker cluster on each controller with --force and --timeout=5m " is --force necessary ? and why not do "pcs cluster stop --all" on one controller so that pacemaker shutdown is nice and synchronised ? since this is a soft reboot testing scenario (all services are checked and soft handled before and after according to official documentation: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/14-beta/html/director_installation_and_usage/sect-rebooting_the_overcloud#rebooting_controller_nodes So since "infrared tripleo-overcloud --postreboot True" does pacemaker shutdown Not according to documentation, I would repeat this test with two scenarios differences : 1) reboot only controller and compute nodes 2) reboot the whole overcloud Both manually with "pcs cluster stop --all" with the condition that command executes successfully. I'll try to reproduce it myself, feel free too
BTW it reproduces on OVN deployment too.
Issue reproduces : no need for the full restart scenario: reproduce actions: 1) choose one controller, do pcs cluster stop, 2) wait , 3)then reboot 4)wait , see that resources in Stoppeed mode are : rabbitmq-bundle-1, galera-bundle-1,redis-bundle-1,haproxy-bundle-docker-1 so this is a bundle issue. and not a duplicate of : https://bugzilla.redhat.com/show_bug.cgi?id=1656368 since this bug's fixes are in that one: http://pastebin.test.redhat.com/688426 [stack@undercloud-0 ~]$ ansible controller-1 -mshell -b -a'rpm -qa|grep resource-agents' controller-1 | SUCCESS | rc=0 >> resource-agents-4.1.1-12.el7_6.7.x86_64
#checking using resource debug-start: #pcs resource debug-start rabbitmq --full => http://pastebin.test.redhat.com/688460 From the output and resource-agent/rabbitmq-cluster flow: resource starts -> rmq_try_start() -> rmq_monitor() ... > stderr: + 13:51:39: rmq_monitor:200: /usr/sbin/rabbitmqctl eval 'rabbit_mnesia:cluster_status_from_mnesia().' > stderr: /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster: line 200: /usr/sbin/rabbitmqctl: No such file or directory > stderr: + 13:51:39: __ha_log:248: echo 'rabbitmq-cluster(rabbitmq)[434749]: Dec' 27 13:51:39 'INFO: RabbitMQ server could not get cluster status from mnesia' and all goes downhill from there..
#guessing this is missing : (provides rabbitmqctl) yum install -y rabbitmq-server-3.6.16-1.el7ost.noarch (previous error does away) but still , node bundles don't survive a reboot , they do however survive pcs cluster stop && pcs cluster start. maybe recent fixes around this have some to do with it : (heartbeat/rabbitmq-cluster) rmq_delete_nodename() ... ${HA_SBIN_DIR}/crm_attribute -N $NODENAME -l reboot --name "$RMQ_CRM_ATTR_COOKIE" -D
update : doing "pcs resource refresh --force" or "pcs resource cleanup" seems to clear up the error count so that the bundles start one by one before the following was logged in corosync.log : #after reboot : Dec 28 03:59:09 [594832] controller-0 pengine: info: determine_online_status: Node controller-1 is online Dec 28 03:59:09 [594832] controller-0 pengine: warning: unpack_rsc_op_failure: Processing failed start of rabbitmq-bundle-docker-1 on controller-1: unknown error | rc=1 Dec 28 03:59:09 [594832] controller-0 pengine: warning: unpack_rsc_op_failure: Processing failed start of galera-bundle-docker-1 on controller-1: unknown error | rc=1 Dec 28 03:59:09 [594832] controller-0 pengine: warning: unpack_rsc_op_failure: Processing failed start of redis-bundle-docker-1 on controller-1: unknown error | rc=1 Dec 28 03:59:09 [594832] controller-0 pengine: warning: unpack_rsc_op_failure: Processing failed start of haproxy-bundle-docker-1 on controller-1: unknown error | rc=1 Dec 28 03:59:09 [594832] controller-0 pengine: info: pe_get_failcount: Container rabbitmq-bundle-docker-1 and the resources within it have failed INFINITY times on controller-1 Dec 28 03:59:09 [594832] controller-0 pengine: warning: check_migration_threshold: Forcing rabbitmq-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000) Dec 28 03:59:09 [594832] controller-0 pengine: info: pe_get_failcount: Container galera-bundle-docker-1 and the resources within it have failed INFINITY times on controller-1 Dec 28 03:59:09 [594832] controller-0 pengine: warning: check_migration_threshold: Forcing galera-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000) Dec 28 03:59:09 [594832] controller-0 pengine: info: pe_get_failcount: Container redis-bundle-docker-1 and the resources within it have failed INFINITY times on controller-1 Dec 28 03:59:09 [594832] controller-0 pengine: warning: check_migration_threshold: Forcing redis-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000) Dec 28 03:59:09 [594832] controller-0 pengine: info: pe_get_failcount: haproxy-bundle-docker-1 has failed INFINITY times on controller-1 Dec 28 03:59:09 [594832] controller-0 pengine: warning: check_migration_threshold: Forcing haproxy-bundle-docker-1 away from controller-1 after 1000000 failures (max=1000000) Dec 28 03:59:09 [594832] controller-0 pengine: info: RecurringOp: Start recurring monitor (60s) for rabbitmq-bundle-1 on controller-1 Dec 28 03:59:09 [594832] controller-0 pengine: info: RecurringOp: Start recurring monitor (60s) for galera-bundle-1 on controller-1 Dec 28 03:59:09 [594832] controller-0 pengine: info: RecurringOp: Start recurring monitor (60s) for redis-bundle-1 on controller-1 Dec 28 03:59:09 [594832] controller-0 pengine: notice: LogAction: * Start rabbitmq-bundle-1 ( controller-1 ) due to unrunnable rabbitmq-bundle-docker-1 start (blocked) Dec 28 03:59:09 [594832] controller-0 pengine: notice: LogAction: * Start galera-bundle-1 ( controller-1 ) due to unrunnable galera-bundle-docker-1 start (blocked) Dec 28 03:59:09 [594832] controller-0 pengine: notice: LogAction: * Start redis-bundle-1 ( controller-1 ) due to unrunnable redis-bundle-docker-1 start (blocked)
update , so after a reboot all of a specific node's bundles have a fail count which which then keads pacemaker to force the bundle resource away from it's controller. workaround : pcs resource refresh --force --node controller-1 this way bundled resources' failures are rechecked and the bundles come back online on controller-1 two issues here: - why is resource refresh behaviour not handled on a controller reboot , so that correct reprobe of the resources is done on boot.. - resource max failures is reached but then it's not handled by any crm reprobe action ?
not reproducible in latest passed_phase2 puddle fixed in : core_puddle_version :2018-12-20.4 #testing restart of pacemaker and bundles: http://pastebin.test.redhat.com/689808 #testing of booting a nova instance after reboot of a controller: http://pastebin.test.redhat.com/689810
Thanks Pini for taking care of this! Do you happen to know what the actual fix has been?
I can't pinpoint the exact fix, and since the reproducer is manual is difficult to establish a valid testing baseline here. I've been monitoring pacemaker's bundles (rabbit,etc) and corosync logs with reboots and multiple scenarios , from what I could see pacemaker did not recheck bundles' state and considered them still not eligable to run, on later testing this was followed by a pacemaker's monitoring operations kicking into place and restarting of the bundles.
The Bug was just reproduced on titan38, You are welcome to check it.
#from rabbitmq logs: =WARNING REPORT==== 3-Jan-2019::09:27:20 === closing AMQP connection <0.25933.15> (172.17.1.11:35978 -> 172.17.1.39:5672 - neutron-server:35:64ccbfd3-f02f-4661-b80e-a910227423a1, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection #from neytron server logs: <- alot of reconnects 2019-01-03 09:30:54.000 31 ERROR oslo.messaging._drivers.impl_rabbit [req-d3576839-62fa-46dc-b261-a471437fb914 - - - - -] [dd6e5a63-fb1a-47d1-b613-471e6db192cd] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out #rabbit logs show alot of : <- these maybe related to a reboot =ERROR REPORT==== 3-Jan-2019::09:00:40 === Discarding message {'$gen_cast',{deliver,{delivery,false,true,<0.2096.15>,{basic_message,{resource,<<"/">>,exchange,<<"q-server-resource-versions_fanout">>},[<<>>],{content,60,{'P_basic',<<"application/json">>,<<"utf-8">>,[],2,0,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},<<248,0,16,97,112,112,108,105,99,97,116,105,111,110,47,106,115,111,110,5,117,116,102,45,56,0,0,0,0,2,0>>,rabbit_framing_amqp_0_9_1,[<<"{\"oslo.message\": \"{\\"_context_domain\\": null, \\"_context_request_id\\": \\"req-d3576839-62fa-46dc-b261-a471437fb914\\", \\"_context_global_request_id\\": null, \\"_context_auth_token\\": null, \\"_context_resource_uuid\\": null, \\"_context_tenant_name\\": null, \\"_context_user\\": null, \\"_context_user_id\\": null, \\"_context_system_scope\\": null, \\"_context_show_deleted\\": false, \\"_context_is_admin\\": true, \\"version\\": \\"1.0\\", \\"_context_project_domain\\": null, \\"_context_timestamp\\": \\"2019-01-02 22:36:56.666997\\", \\"method\\": \\"report_agent_resource_versions\\", \\"_context_project\\": null, \\"_context_roles\\": [], \\"args\\": {\\"version_map\\": {\\"Subnet\\": \\"1.0\\", \\"Network\\": \\"1.0\\", \\"PortForwarding\\": \\"1.0\\", \\"SubPort\\": \\"1.0\\", \\"SecurityGroup\\": \\"1.0\\", \\"SecurityGroupRule\\": \\"1.0\\", \\"Trunk\\": \\"1.1\\", \\"QosPolicy\\": \\"1.7\\", \\"Port\\": \\"1.4\\", \\"Log\\": \\"1.0\\"}, \\"agent_type\\": \\"Open vSwitch agent\\", \\"agent_host\\": \\"controller-1.localdomain\\"}, \\"_unique_id\\": \\"ac1c4525959f4a53b636b01611e4dc0a\\", \\"_context_tenant_id\\": null, \\"_context_is_admin_project\\": true, \\"_context_project_name\\": null, \\"_context_user_identity\\": \\"- - - - -\\", \\"_context_tenant\\": null, \\"_context_project_id\\": null, \\"_context_read_only\\": false, \\"_context_user_domain\\": null, \\"_context_user_name\\": null}\", \"oslo.version\": \"2.0\"}">>]},<<55,49,240,202,110,182,239,112,255,54,215,95,92,238,64,243>>,true},1,flow},false}} from <0.2096.15> to <0.3614.0> in an old incarnation (1) of this node (2) #after I did a overcloud pacemaker cluster restart (pcs cluster stop --all&&pcs cluster start --all) Now things seem back to normal #now server creation is succesfull: (overcloud) [stack@undercloud-0 ~]$ openstack server create --flavor m1.micro --image cirros-0.3.5-x86_64-disk.img --network 5104edf3-b3f3-4adc-ad2a-124336adc3fa osvm2 --wait +-------------------------------------+---------------------------------------------------------------------+ | Field | Value | +-------------------------------------+---------------------------------------------------------------------+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | compute-1.localdomain | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute-1.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-00000301 | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2019-01-03T09:43:33.000000 | | OS-SRV-USG:terminated_at | None | | accessIPv4 | | | accessIPv6 | | | addresses | internal_net_b326fd44c2=192.168.0.14 | | adminPass | 98YDt4LTeD5y | | config_drive | | | created | 2019-01-03T09:40:18Z | | flavor | m1.micro (a2c6130a-5f98-4038-8914-5e75b8b873ea) | | hostId | 9af54f0512fbc853856e62f4eaabbc049bfb186a33f2139f1f5cfcde | | id | 9cad3922-7886-48e0-87b1-fc73000ae02b | | image | cirros-0.3.5-x86_64-disk.img (60c3bd60-8309-421e-8a16-e7f9af7219e3) | | key_name | None | | name | osvm2 | | progress | 0 | | project_id | 851bf1c1256b438191e0516ed551813d | | properties | | | security_groups | name='default' | | status | ACTIVE | | updated | 2019-01-03T09:43:33Z | | user_id | 8940592d97c44117a685ab75d702855d | | volumes_attached | | +-------------------------------------+---------------------------------------------------------------------+
after manually reproducing the overcloud reboot steps in : https://github.com/redhat-openstack/infrared/blob/master/plugins/tripleo-overcloud/overcloud_reboot.yml I can say manually, I do not see the errors that are present after that playbook is run, I pcs cluster stopped each controller and made sure it's up and pacemaker healthy in turn . then nova live migrated each compute's instances and rebooted it. All actions have been successful ,all pacemaker resources and all instances are Active an healthy after the manual procedure. In addition the step where the ansible playbook Failed passed for me(in the action logs). manual actions and results log: http://pastebin.test.redhat.com/690459 sos reports during the procedure: http://rhos-release.virt.bos.redhat.com/log/pkomarov_sosreports/BZ1661806/ My understanding is that this is an automation error (blaming the overcloud_reboot ansible playbook) rather than a bug , but to be sure , Raviv can you run the ir reboot procedure a couple of times and report the results ?
The IR reboot command failed again in the same place and with the same error. That been said, the system is not functional and all services are constantly reconnecting to AMQP server. =WARNING REPORT==== 7-Jan-2019::09:01:23 === closing AMQP connection <0.25239.5059> (172.17.1.19:42468 -> 172.17.1.11:5672 - neutron-server:33:57e91aed-bbb0-4ee8-9760-66a2e1f5d94f, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection =WARNING REPORT==== 7-Jan-2019::09:01:23 === closing AMQP connection <0.4118.4994> (172.17.1.19:42466 -> 172.17.1.11:5672 - neutron-server:33:e0c86681-7785-4b37-a496-4183a1f16a16, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection =WARNING REPORT==== 7-Jan-2019::09:01:23 === closing AMQP connection <0.10045.5532> (172.17.1.19:42476 -> 172.17.1.11:5672 - neutron-server:32:f428ac86-a9b0-492c-a606-435642ff9a79, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection =WARNING REPORT==== 7-Jan-2019::09:01:23 === closing AMQP connection <0.1599.4801> (172.17.1.19:42486 -> 172.17.1.11:5672 - neutron-server:34:ecb32873-5d9f-4574-ac0b-998e1e31afdb, vhost: '/', user: 'guest'): client unexpectedly closed TCP connection =INFO REPORT==== 7-Jan-2019::09:01:24 === accepting AMQP connection <0.9040.4835> (172.17.1.19:55848 -> 172.17.1.11:5672) pcs status show no error: [root@controller-2 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.19-8.el7_6.2-c3c624ea3d) - partition with quorum Last updated: Mon Jan 7 09:28:29 2019 Last change: Thu Jan 3 15:47:05 2019 by redis-bundle-1 via crm_attribute on controller-1 12 nodes configured 38 resources configured Online: [ controller-0 controller-1 controller-2 ] GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-1 galera-bundle-2@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-1 rabbitmq$bundle-2@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-1 redis-bundle-2@controller-2 ] Full list of resources: Docker container set: rabbitmq-bundle [192.168.24.1:8787/rhosp14/openstack-rabbitmq:pcmklatest] rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0 rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-1 rabbitmq-bundle-2 (ocf::heartbeat:rabbitmq-cluster): Started controller-2 Docker container set: galera-bundle [192.168.24.1:8787/rhosp14/openstack-mariadb:pcmklatest] galera-bundle-0 (ocf::heartbeat:galera): Master controller-0 galera-bundle-1 (ocf::heartbeat:galera): Master controller-1 galera-bundle-2 (ocf::heartbeat:galera): Master controller-2 Docker container set: redis-bundle [192.168.24.1:8787/rhosp14/openstack-redis:pcmklatest] redis-bundle-0 (ocf::heartbeat:redis): Slave controller-0 redis-bundle-1 (ocf::heartbeat:redis): Master controller-1 redis-bundle-2 (ocf::heartbeat:redis): Slave controller-2 ip-192.168.24.14 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.13 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.1.25 (ocf::heartbeat:IPaddr2): Started controller-2 ip-172.17.3.11 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.16 (ocf::heartbeat:IPaddr2): Started controller-2 Docker container set: haproxy-bundle [192.168.24.1:8787/rhosp14/openstack-haproxy:pcmklatest] haproxy-bundle-docker-0 (ocf::heartbeat:docker): Started controller-0 haproxy-bundle-docker-1 (ocf::heartbeat:docker): Started controller-1 haproxy-bundle-docker-2 (ocf::heartbeat:docker): Started controller-2 Docker container: openstack-cinder-volume [192.168.24.1:8787/rhosp14/openstack-cinder-volume:pcmklatest] openstack-cinder-volume-docker-0 (ocf::heartbeat:docker): Started controller-1 Docker container: openstack-cinder-backup [192.168.24.1:8787/rhosp14/openstack-cinder-backup:pcmklatest] openstack-cinder-backup-docker-0 (ocf::heartbeat:docker): Started controller-1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Seeing that env, on the neutron and rabbitmq : (undercloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'grep ERROR /var/log/containers/neutron/*|tail' [WARNING]: Found both group and host with same name: undercloud compute-1 | SUCCESS | rc=0 >> /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout, retry=retry) /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 573, in _send /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent call_monitor_timeout) /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in wait /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = self.waiters.get(msg_id, timeout=timeout) /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 336, in get /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent 'to message ID %s' % msg_id) /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent MessagingTimeout: Timed out waiting for a reply to message ID dc939b8e46794bdfa8300a90da45cffa /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:01:02.492 7553 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent compute-0 | SUCCESS | rc=0 >> /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to process message ... skipping it.: DuplicateMessageError: Found duplicate message(f7e5c65cb22b4238a2be9016fa2f9e5c). Skipping it. /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last): /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 361, in _callback /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit self.callback(RabbitMessage(message)) /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 251, in __call__ /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit unique_id = self.msg_id_cache.check_duplicate_message(message) /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqp.py", line 122, in check_duplicate_message /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit raise rpc_common.DuplicateMessageError(msg_id=msg_id) /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit DuplicateMessageError: Found duplicate message(f7e5c65cb22b4238a2be9016fa2f9e5c). Skipping it. /var/log/containers/neutron/openvswitch-agent.log.1:2019-01-07 10:00:34.243 8165 ERROR oslo.messaging._drivers.impl_rabbit controller-0 | SUCCESS | rc=0 >> /var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.041 33 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [2011ceb7-70d9-4b47-9ca0-293e7f203623] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.066 31 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [fffac77f-a0e8-4b80-bcd2-ffd3f76c80ea] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:17.716 34 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [9add6b67-992c-4fc1-b332-9392e1397f1b] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:17.731 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [03c8ae06-0fae-4a57-928a-cc9819f81f5e] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:18.164 33 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [d4298671-2063-46fe-a0c7-e64592d67878] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:19.955 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [41168b13-0683-4986-8d69-c02fc11c6d38] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.120 35 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [3c9b58b2-0daf-48c7-82c5-c26490679d30] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.421 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [41950f6c-0346-42a3-bbe5-17599ee47c27] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.432 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [98042e42-8390-4a0e-808e-3979084f0aa6] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:20.778 35 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [b1be6646-09e0-4932-a664-3f27957f12ed] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out controller-1 | SUCCESS | rc=0 >> /var/log/containers/neutron/server.log.1:2019-01-07 11:00:23.669 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [e8f33ad0-b4d8-4371-af3a-e7c140f387ba] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:24.281 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [beb0bdf4-dad1-46df-91b1-93f0a8308e8f] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:25.254 33 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [06f8beac-c9f7-4ca9-9d7f-d24ce59d17ac] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:25.718 32 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [77e6b69d-9036-4611-b6ef-88d2477b977d] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:25.770 32 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [1ae40b83-e99a-417f-9839-b47051088172] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:26.190 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [b87d6ddd-7d88-4b20-8ef0-2f6487a3c667] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:27.007 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [d8854703-a6f6-46b9-88f9-98fb364082d9] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:27.173 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [16dd3d75-464f-4df0-acaf-7fc7f6631a9b] AMQP server on controller-0.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:27.382 34 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [0ae351cf-2e12-4504-b1b7-ae6a49799f2e] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:00:28.744 33 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [e0c86681-7785-4b37-a496-4183a1f16a16] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out controller-2 | SUCCESS | rc=0 >> /var/log/containers/neutron/server.log.1:2019-01-07 11:01:11.822 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [097f69c1-cd44-4fa4-a475-5e364738aa7c] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:11.926 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [95687baa-4423-46b3-8469-26a40ca1f6a3] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:12.200 32 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [30c3c84e-4725-453e-9f2d-6112f2544a66] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:12.774 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [a2dd661d-bd30-4a0c-9c3d-88af83f89fe2] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:14.945 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [7896a37d-a872-4db3-bbea-ef256351c868] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:15.193 30 ERROR oslo.messaging._drivers.impl_rabbit [req-02b987ef-7648-4fb3-9909-daf3e1cd5a32 - - - - -] [99c8b542-c0e7-4b2e-b8c7-3c88ad7cc710] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:15.326 30 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [7adfc2f1-ea84-441e-8788-8212b858c95e] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:15.826 31 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [e1261124-0fa6-43db-be24-c699ff07251e] AMQP server on controller-1.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.430 33 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [1e4ff701-48dc-4c90-80a0-4b06dd7d42e6] AMQP server on controller-2.internalapi.localdomain:5672 is unreachable: timed out. Trying again in 1 seconds.: timeout: timed out /var/log/containers/neutron/server.log.1:2019-01-07 11:01:16.481 34 ERROR oslo.messaging._drivers.impl_rabbit [req-7b26fc95-5888-4717-a232-551f2830eaa1 - - - - -] [f1cb0285-5edc-4297-912e-9cebc15acf30] AMQP server controller-2.internalapi.localdomain:5672 closed the connection. Check login credentials: Socket closed: IOError: Socket closed ceph-0 | SUCCESS | rc=0 >> grep: /var/log/containers/neutron/*: No such file or directory ceph-1 | SUCCESS | rc=0 >> grep: /var/log/containers/neutron/*: No such file or directory ceph-2 | SUCCESS | rc=0 >> grep: /var/log/containers/neutron/*: No such file or directory (undercloud) [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a"grep 'closing\|unexpectedly' /var/log/containers/rabbitmq/*|tail" [WARNING]: Found both group and host with same name: undercloud compute-1 | SUCCESS | rc=0 >> grep: /var/log/containers/rabbitmq/*: No such file or directory compute-0 | SUCCESS | rc=0 >> grep: /var/log/containers/rabbitmq/*: No such file or directory controller-0 | SUCCESS | rc=0 >> /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.8062.161> (172.17.1.39:40134 -> 172.17.1.39:5672 - neutron-server:33:d4298671-2063-46fe-a0c7-e64592d67878, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.8872.161> (172.17.1.11:35398 -> 172.17.1.39:5672 - neutron-server:30:2d250e51-b3d3-4394-9dda-869b5d159a0b, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.9085.161> (172.17.1.11:35420 -> 172.17.1.39:5672 - neutron-server:30:ec96e54a-697c-4b78-a07a-1fce56891653, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.6866.161> (172.17.1.39:40250 -> 172.17.1.39:5672 - neutron-server:35:3c9b58b2-0daf-48c7-82c5-c26490679d30, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.8122.161> (172.17.1.39:40306 -> 172.17.1.39:5672 - neutron-server:35:b1be6646-09e0-4932-a664-3f27957f12ed, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection controller-1 | SUCCESS | rc=0 >> /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.22472.733> (172.17.1.19:42740 -> 172.17.1.19:5672 - neutron-server:33:81d8ed32-3cae-41db-85ec-741fbddf2af9, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.2203.734> (172.17.1.19:43152 -> 172.17.1.19:5672 - neutron-server:32:303ef188-6d24-4d87-ab72-47967e91fd4b, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.23214.733> (172.17.1.39:33868 -> 172.17.1.19:5672 - neutron-server:35:ed1a7191-405b-4913-9a80-dffb16241a1e, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.11358.733> (172.17.1.19:43200 -> 172.17.1.19:5672 - neutron-server:31:c30cfaed-5263-464c-88d2-ac7100469ece, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.23112.733> (172.17.1.19:43202 -> 172.17.1.19:5672 - neutron-server:31:e8071e69-9ada-4066-b3e3-fd2f93693e23, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection controller-2 | SUCCESS | rc=0 >> /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.25115.4939> (172.17.1.19:48384 -> 172.17.1.11:5672 - neutron-server:32:b87d6ddd-7d88-4b20-8ef0-2f6487a3c667, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.26332.4617> (172.17.1.19:48752 -> 172.17.1.11:5672 - neutron-server:34:14e9a285-1a7d-4593-9c8f-a80121864e95, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.10281.5012> (172.17.1.19:42358 -> 172.17.1.11:5672 - neutron-server:32:03a5ad79-d8c3-4d51-bcd1-4afdac27c38a, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.16716.4442> (172.17.1.39:33098 -> 172.17.1.11:5672 - neutron-server:34:c7b9189d-4d61-49f2-821f-e989c8518dab, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection /var/log/containers/rabbitmq/rabbit.1:closing AMQP connection <0.23174.5374> (172.17.1.11:55596 -> 172.17.1.11:5672 - neutron-server:32:c5fd4df7-7b6b-4e26-9e24-4047944e7e3e, vhost: '/', user: 'guest'): /var/log/containers/rabbitmq/rabbit.1:client unexpectedly closed TCP connection ceph-0 | SUCCESS | rc=0 >> grep: /var/log/containers/rabbitmq/*: No such file or directory ceph-1 | SUCCESS | rc=0 >> grep: /var/log/containers/rabbitmq/*: No such file or directory ceph-2 | SUCCESS | rc=0 >> grep: /var/log/containers/rabbitmq/*: No such file or directory
I was able to reproduce the errors here using the same playbook After applying the patch above the cluster came back healthy after the restart playbook. patch : https://review.gerrithub.io/439266 Results and cluster health checks after applying the patch : http://pastebin.test.redhat.com/691818 Raviv, can you : - return the pacemaker clusters to full health with : pcs cluster stop --all && pcs cluster start --all(on one controller) - test the ir-reboot playbook with the suggested patch ? (gerrit_change - 439266) Thanks
Jenkins jobs are still failing the same way, We need to find better solution for this problem. https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-14-from-RC-HA-ipv4/9/console
Update : I'm seeing a nother issue here, and this is before the Overcloud reboot, : This is from the ovn jobs : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/ There is an OC instance created during the Overcloud-update : this instance has a Cinder Volume which is in ERROR state, and that prevents the live migration: Instance and volume attachment happen here : (created before the OC reboot : https://github.com/openstack/tripleo-upgrade/blob/c5038babb5966d3998e807b668c9ecc477175cb0/templates/workload_launch.sh.j2#L211) The attachement error can be seen here: /var/log/containers/nova/nova-compute.log:2019-01-17 12:52:21.075 1 ERROR nova.volume.cinder [req-9cc1f5b0-023d-4eca-b071-3d102d2776d4 ec75d0e6c48c4e8e9ee9818e0b88bdf5 8fa2e29ca76f4106955124dd1dea5331 - default default] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Create attachment failed for volume 68bb0540-d8f1-4cdf-962b-7f66e2523821. Error: Unable to create attachment for volume and also here which causes live migration to halt: /var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [-] [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] Pre live migration failed at compute-0.localdomain: RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92) ... /var/log/containers/nova/nova-compute.log:2019-01-17 12:52:22.169 1 ERROR nova.compute.manager [instance: 27aa9d02-d6f0-46e7-a160-563aed8e654c] RemoteError: Remote error: ClientException Unable to create attachment for volume. (HTTP 500) (Request-ID: req-d541d697-159f-4e4b-ab44-a7dd10780b92) Cinder also confirms the error with the volume: /var/log/containers/cinder/cinder-volume.log:2019-01-17 10:23:42.514 70 ERROR cinder.volume.manager Stderr: u' Failed to find logical volume "cinder-volumes/volume-68bb0540-d8f1-4cdf-962b-7f66e2523821"\n' To test this with more jobs I've created an infrared commit which tries to check and remove the volume from the workload instance before the OC_reboot procedure : https://review.gerrithub.io/#/c/redhat-openstack/infrared/+/441197/3 You can reproduce this using : IR_GERRIT_CHANGE: 441197/3 This patch tries to do: openstack server remove volume $vol_id $server_workload_id Results : After Overcloud update instance volume is in Error state http://pastebin.test.redhat.com/699078 This result reproduced again and again and I believe this is the source of the problem here : https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/71/artifact/.sh/ir-tripleo-overcloud-reboot.log https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-networking-ovn-update-14_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/70/artifact/.sh/ir-tripleo-overcloud-reboot.log Seeing as the previous issue with rabbit-neutron post reboot is no longer the prime suspect here (this is happening pre oc_reboot), I'm changing this bug direction towards the error'd volume after OC_upgrade.
Following further testing I'm resetting the original subject for this bug , the former may be a side affect. Currently creating automation to reproduce and test this $title scenarios.
This bug is reproduced with all our automation jobs, The reboot procedure is done according to official Redhat documentation.
HI PIni, All We saw the bug reproduced with working healthy volume and without any volume at all, This BUG should not be blocked or depend on BZ1668568
Hi, Any updates about this case? I'm hitting the same bug in RHOSP13, which is reported as BZ1592528 , and would like to have a fix in RHOSP13 also. I checked the attached url, but I can not find a clear reason why changing the said yaml can be a solution here, while the problem is happening in RabbitMQ. Thank you,
Guys we really need to get the fix to this issue. This issue is effecting on all update/upgrades jobs. I am adding sos report of my setup with the same problem. http://rhos-release.virt.bos.redhat.com/log/bz1661806