Bug 1441635
Summary: | OSP10 -> OSP11 upgrade: nova instance live migration gets stuck with MIGRATING status before running compute node upgrade | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
Component: | openstack-tripleo-heat-templates | Assignee: | Peter Lemenkov <plemenko> | ||||
Status: | CLOSED ERRATA | QA Contact: | Udi Shkalim <ushkalim> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 11.0 (Ocata) | CC: | apevec, dbecker, fdinitto, jcoufal, jeckersb, lbopf, lhh, mabaakou, mandreou, markmc, mburns, mcornea, michele, morazi, panbalag, plemenko, rhel-osp-director-maint, rscarazz, sasha, sathlang, srevivo, vstinner | ||||
Target Milestone: | rc | Keywords: | Triaged | ||||
Target Release: | 11.0 (Ocata) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | puppet-tripleo-6.3.0-12.el7ost openstack-tripleo-heat-templates-6.0.0-9.el7ost | Doc Type: | Release Note | ||||
Doc Text: |
If this bug requires documentation, please select an appropriate Doc Type value.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1447355 (view as bug list) | Environment: | |||||
Last Closed: | 2017-05-17 20:20:24 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1436784 | ||||||
Bug Blocks: | 1447355 | ||||||
Attachments: |
|
Description
Marius Cornea
2017-04-12 11:40:23 UTC
o/ marius spent some time poking at logs here. I can see rabbit related errors on compute-r00-00 (the 'migrate_to' node) in /var/log/nova/nova-compute.log [0]. The main question I have is was this compute node in a good state before you tried the migration? I think those rabbit errors coincide with rabbit going down on the controller upgrade. Does a service restart (nova-compute on the affected compute-0 for starters) fix this issue and so could it be related to BZ 1440680 but that is still TBD. [0] 2017-04-12 10:27:14.420 58749 INFO nova.service [-] Starting compute node (version 14.0.4-4.el7ost) 2017-04-12 10:27:14.454 58749 INFO nova.virt.libvirt.driver [-] Connection event '1' reason 'None' 2017-04-12 10:27:14.473 58749 INFO nova.virt.libvirt.host [req-9ccee0d6-80ff-44e2-af3d-fe3cae789a2c - - - - -] Libvirt host capabilities <capabilities> 2017-04-12 10:33:42.079 58749 ERROR oslo.messaging._drivers.impl_rabbit [-] [6346950d-3061-40e3-8b55-3e2315a007f9] AMQP server on fd00:fd00:fd00:2000::22:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds. Client port: None 2017-04-12 10:36:21.334 58749 ERROR nova.compute.manager raise exceptions.from_response(resp, method, url) 2017-04-12 10:36:21.334 58749 ERROR nova.compute.manager ServiceUnavailable: Service Unavailable (HTTP 503) 2017-04-12 10:36:21.334 58749 ERROR nova.compute.manager 2017-04-12 10:36:53.258 58749 ERROR nova.compute.manager [req-fdb8332f-ea5c-45e4-ad9d-456bd31518b2 - - - - -] [instance: dfa3192e-92b7-4175-a102-3b730a37331c] An error occurred while refreshing the network cache. 2017-04-12 10:36:53.258 58749 ERROR nova.compute.manager [instance: dfa3192e-92b7-4175-a102-3b730a37331c] Traceback (most recent call last): 2017-04-12 10:36:53.258 58749 ERROR nova.compute.manager [instance: dfa3192e-92b7-4175-a102-3b730a37331c] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5790, in _heal_instance_info_cache 2017-04-12 11:38:45.194 58749 ERROR root NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'compute.comp-r00-00.redhat.local' in vhost '/' due to timeout 2017-04-12 11:32:25.177 58749 ERROR oslo.messaging._drivers.impl_rabbit [req-fdb8332f-ea5c-45e4-ad9d-456bd31518b2 - - - - -] [22097b85-0b02-41e7-8afe-41d37bd251f9] AMQP server on fd00:fd00:fd00:2000::22:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 56946 2017-04-12 11:32:25.743 58749 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'compute.comp-r00-00.redhat.local' in vhost '/' due to timeout 2017-04-12 11:32:25.743 58749 ERROR root [-] Unexpected exception occurred 1 time(s) (In reply to marios from comment #2) > o/ marius spent some time poking at logs here. I can see rabbit related > errors on compute-r00-00 (the 'migrate_to' node) in > /var/log/nova/nova-compute.log [0]. The main question I have is was this > compute node in a good state before you tried the migration? I think those > rabbit errors coincide with rabbit going down on the controller upgrade. > Does a service restart (nova-compute on the affected compute-0 for starters) > fix this issue and so could it be related to BZ 1440680 but that is still > TBD. > After running systemctl restart openstack-nova-compute.service on the compute nodes both services go down: | 78 | nova-compute | comp-r01-01.redhat.local | nova | enabled | down | 2017-04-12T15:25:26.000000 | - | | 87 | nova-compute | comp-r00-00.redhat.local | nova | enabled | down | 2017-04-12T15:24:26.000000 | - | I was able to recover by restarting rabbitmq on the controllers by 'pcs resource restart rabbitmq'. Note: at this point the controllers are already upgraded. Live migration also completed fine after restarting rabbitmq on controllers so it appears to me that the issue is with Nova services reaching the RabbitMQ servers. I'm going to see if this error reproduces on the same environment with a fresh OSP10 deployment to rule out if this is an issue showing up during upgrade or not. I confirmed live migration works fine on the same environment after fresh OSP10 deployment. It looks that restart rabbitmq on controller nodes doesn't always fixes this issue. I added it as a workaround in the jobs affected by this and I am still able to reproduce it. mcornea please next box this reproduces on can you set it aside so we can further investigate what is going on after the controller upgrade. I was surprised to see that rabbit restart fixed this issue (although not always. Is it both? nova-compute and rabbit restart?). I went looking for any recent v6 related changes that might affect this (could only find https://review.openstack.org/#/c/450144/ but pretty sure not related; it woulnd't be fixed by rabbit restart if it was ipv6 rules related). If it turns out we need more/extra service restart we can do that but still not clear what the issue is. Hi Marius, regarding comment 8, could you check if adding the IPv4 entries in the etc/hosts of all servers before doing the migration solves this issue ? That would narrow the problem to this ipv4 dns resolution. Thanks, (In reply to Sofer Athlan-Guyot from comment #10) > Hi Marius, > > regarding comment 8, could you check if adding the IPv4 entries in the > etc/hosts of all servers before doing the migration solves this issue ? > That would narrow the problem to this ipv4 dns resolution. > > Thanks, did you mean ipv6? I think it is not reproducible in v4 as per comment #0 Hi Marios, > did you mean ipv6? I think it is not reproducible in v4 as per comment #0 well the referenced problem in comment 8 is about rabbitmq resolving hostname on ipv4 address, and thus avoiding the defined (in etc/hosts) ipv6 resolution. So my comment was about testing that this problem was similar to the one suggested by Fabio. But all in all I don't think this is related to bz 1360398. Looking again at the log we can see: sosreport-comp-r00-00.redhat.local-20170412114357/var/log/nova/nova-compute.log 2017-04-12 11:09:15.045 58749 ERROR oslo.messaging._drivers.impl_rabbit [req-fdb8332f-ea5c-45e4-ad9d-456bd31518b2 - - - - -] [22097b85-0b02-41e7-8afe-41d37bd251f9] AMQP server on fd00:fd00:fd00:2000::22:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 56696 2017-04-12 11:09:15.559 58749 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'compute.comp-r00-00.redhat.local' in vhost '/' due to timeout So I don't think that dns resolution (hosts file) has anything to do with it as the connection takes place. In the rabbitmq log we have the matching error: sosreport-ctrl-r02-02.redhat.local-20170412115605/var/log/rabbitmq/rabbit =ERROR REPORT==== 12-Apr-2017::11:12:15 === closing AMQP connection <0.11205.0> ([FD00:FD00:FD00:2000::26]:56696 -> [FD00:FD00:FD00:2000::22]:5672 - nova-compute:58749:22097b85-0b02-41e7-8afe-41d37bd251f9): missed heartbeats from client, timeout: 60s It looks a bit like https://bugs.launchpad.net/oslo.messaging/+bug/1609766 which is included in version openstack/oslo.messaging 5.10.1, which should be included on the non-upgraded node: sosreport-comp-r00-00.redhat.local-20170412114357/sos_commands/rpm/package-data 329:python-oslo-messaging-5.10.1-2.el7ost.noarch Fri Apr 7 20:53:20 2017 Asking Mehdi if it rings a bell anyway as the problem looks similar. I wonder if that is normal on crtl00 in /var/log/rabbitmq/rabbit ? =ERROR REPORT==== 12-Apr-2017::11:07:08 === Mnesia('rabbit@ctrl-r00-00'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@ctrl-r02-02'} =WARNING REPORT==== 12-Apr-2017::11:07:08 === Mirrored queue 'neutron-vo-SubPort-1.0' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 12-Apr-2017::11:07:08 === Mirrored queue 'neutron-vo-QosPolicy-1.3' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available =WARNING REPORT==== 12-Apr-2017::11:07:08 === Mirrored queue 'l3_agent.api-r01-01.redhat.local' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is ava =WARNING REPORT==== 12-Apr-2017::11:07:08 === Mirrored queue 'cinder-volume_fanout_14322306aeea471fa93ce8c1482e2ae7' in vhost '/': Stopping all nodes on master shutdown since no sync My guessing is oslo.messaging try to declare a queue on node 0, but node 0 is a slave for this queue, so it asks to the master to do it, and that tcp call (node-0 to master node) timeout. Also rabbit is generating many crash report: # grep -c CRASH rabbit\@ctrl-r00-00-sasl.log 271 "It looks a bit like https://bugs.launchpad.net/oslo.messaging/+bug/1609766 which is included in version openstack/oslo.messaging 5.10.1, which should be included on the non-upgraded node: sosreport-comp-r00-00.redhat.local-20170412114357/sos_commands/rpm/package-data 329:python-oslo-messaging-5.10.1-2.el7ost.noarch Fri Apr 7 20:53:20 2017" This bug was fixed by the the commit "Fix consuming from unbound reply queue": https://review.openstack.org/#/c/365959/ This commit was backported to OSP 10 and was first of the first OSP 11 release: * OSP 10: since python-oslo-messaging 5.10.0-3 (Nov 14 2016) * OSP 11: since the first package, python-oslo-messaging 5.17.0-1 (Feb 08 2017) python-oslo-messaging-5.10.1-2 comes from OSP 10 and is more recent than 5.10.0-3, so it already includes the "Fix consuming from unbound reply queue" fix. In the sosreports, I only see versions with the fix: haypo@selma$ grep messaging */installed-rpms |cut -f2 -d:|sort -u python-oslo-messaging-5.10.1-2.el7ost.noarch Fri Apr 7 20 python-oslo-messaging-5.17.1-2.el7ost.noarch Wed Apr 12 10 Hello All! Status report from RabbitMQ side. * From the logs posted in comment 1 I really don't see anything related to IPv6. Quite the contrary - I can see that RabbitMQ actually handles IPv6 connections properly. * I see GH#1035 (https://github.com/rabbitmq/rabbitmq-server/issues/1035). It should be mostly harmless. See also bug #1436784 where some other failure led to the same issue. * What really surprised me is that it's the second time I can witness the return of GH#530 and GH#544 (first one was in bug #1441685). These two issues believed to be fixed in version 3.6.1, and we're using 3.6.3 and 3.6.5 (after upgrade), so we shouldn't see it. And yet here it is. So far I don't have any seriously looking theories which could explain this. These two are known fatal issues, which can render RabbitMQ cluster unreachable. Created attachment 1272686 [details]
RMQ chain of events
Peter, Just wanted to note that my setup is IPv4 and I was able to reproduce it again. But I still see errors in /var/log/rabbitmq/rabbit@controller-*.log similar to here, https://bugzilla.redhat.com/attachment.cgi?id=1267032 -- (attachment for bug https://bugzilla.redhat.com/show_bug.cgi?id=1436784) (In reply to Prasanth Anbalagan from comment #23) > Peter, > > Just wanted to note that my setup is IPv4 and I was able to reproduce it > again. But I still see errors in /var/log/rabbitmq/rabbit@controller-*.log > similar to here, > > https://bugzilla.redhat.com/attachment.cgi?id=1267032 > -- (attachment for bug https://bugzilla.redhat.com/show_bug.cgi?id=1436784) Please try 3.6.5-2 build. It should fix this particular issue. We're moving this to MODIFY as we have a new package, but we are aware that it may bounce back to on_dev. Peter, Looks like the cause is not resize-confirm. After upgrade or update operation followed by rebooting the nodes, rabbitmq on certain controller nodes shutdown causing some compute nodes to lose connectivity. Upon restarting the rabbitmq service, everything works fine. (In reply to Sofer Athlan-Guyot from comment #27) > We're moving this to MODIFY as we have a new package, but we are aware that > it may bounce back to on_dev. I'm afraid it's too early. This build should fix bug #1436784, and certainly cannot address this one fully. Marius, One thing to try is to comment out the upgrade task which modifies the RabbitMQ policy: https://github.com/openstack/tripleo-heat-templates/blob/stable/ocata/puppet/services/pacemaker/rabbitmq.yaml#L69 You should be able to just comment out the `pcs resource update ...` line. In summary, here's what we see happening: - Something at the end of the upgrade seems to cause a brief (5-6 seconds) loss of network connectivity between the controllers. This can be seen in both the galera and rabbitmq logs at the same time - Rabbitmq nodes detect partition and begin stopping the application due to pause_minority - Partition ends almost immediately, and queues begin to fail over while the application is still in the process of shutting down - Some queues do not fail over correctly because the slave (only one slave because the ha policy is updated already!!!) is out of sync. The queue master stops. - The node finishes shutting down. - Eventually some combination of the partition handling code and the resource agent gets the cluster back up. HOWEVER! When the cluster is back up, the queue master process(es) are still not running. Why exactly, I am not sure. This is probably a bug due to some race condition. - Any operation on the queue from now on will result in a timeout waiting on the non-existent queue master - Migrations fail because necessary queues are stuck. (In reply to John Eckersberg from comment #39) > Marius, > > One thing to try is to comment out the upgrade task which modifies the > RabbitMQ policy: > > https://github.com/openstack/tripleo-heat-templates/blob/stable/ocata/puppet/ > services/pacemaker/rabbitmq.yaml#L69 > > You should be able to just comment out the `pcs resource update ...` line. > > In summary, here's what we see happening: > > - Something at the end of the upgrade seems to cause a brief (5-6 seconds) > loss of network connectivity between the controllers. This can be seen in > both the galera and rabbitmq logs at the same time > > - Rabbitmq nodes detect partition and begin stopping the application due to > pause_minority > > - Partition ends almost immediately, and queues begin to fail over while the > application is still in the process of shutting down > > - Some queues do not fail over correctly because the slave (only one slave > because the ha policy is updated already!!!) is out of sync. The queue > master stops. > > - The node finishes shutting down. > > - Eventually some combination of the partition handling code and the > resource agent gets the cluster back up. HOWEVER! When the cluster is back > up, the queue master process(es) are still not running. Why exactly, I am > not sure. This is probably a bug due to some race condition. > > - Any operation on the queue from now on will result in a timeout waiting on > the non-existent queue master > > - Migrations fail because necessary queues are stuck. Thanks John. I tested by removing the 'pcs resource update rabbitmq set_policy' line and I got better results - 3/4 succesful attempts. The failing one was a bit different than the initial report as the instance didn't get stuck in MIGRATING anymore but it eventually got into ACTIVE state without being migrated from the host. This kind of messages can be observed on the nova compute logs: http://paste.openstack.org/show/607855/ Brief update on where things stand: - We're going to revert back to using "ha-mode all" as the rabbitmq policy. This does not 100% avoid the problem but in theory and in observation it certainly helps quite a bit. - There is still evidence of something odd going on with the network. This will require more investigation, probably involving extensive packet captures and analysis. - Possibly related to the network issue, galera is exhibiting odd behavior where an extra IP address is appearing and then disappearing as a member of the galera cluster. This address is allocated as a virtual IP in the deployment, and should not be part of the galera cluster. We're looking into this more. OK, we've figured this out. The problem is that the VIPs are on the same network as the "normal" controller interfaces, and it causes problems with routing when a VIP moves. Consider controller ctrl-r00-00. It has IPv6 address fd00:fd00:fd00:2000::20/64 on vlan200, which is the internalapi network. The internalapi VIP has IPv6 address fd00:fd00:fd00:2000::5, also on vlan200, the internalapi network. Now, when the VIP resource is active on ctrl-r00-00, the vlan200 interface has both addresses. There also exists a route to fd00:fd00:fd00:2000::/64 on dev vlan200. What happens when galera starts, and the replication connects to the other cluster members? The configuration item used is: wsrep_cluster_address=gcomm://ctrl-r00-00,ctrl-r01-01,ctrl-r02-02 (this is an attribute on the galera resource) Next, ctrl-r00-00 does a name lookup for ctrl-r01-01, which is: [root@ctrl-r00-00 ~]# getent hosts ctrl-r01-01 fd00:fd00:fd00:2000::21 ctrl-r01-01.redhat.local ctrl-r01-01 So we need to make a connection to fd00:fd00:fd00:2000::21. Remember we have a route for that network on dev vlan200, but we have two valid IPv6 addresses on that network. In short, the kernel chooses the VIP address as the source addr for the connection. And that's a problem. When the VIP moves to a different host, any outbound packets become unroutable. So in this circumstance, ctrl-r01-01 and ctrl-r02-02 will see the galera membership of ctrl-r00-00 as originating from the VIP, not from the address normally associated with ctrl-r00-00. So when the VIP moves, galera logs that a member with the VIP address went away. This is the suspicious behavior we observed previously. The same thing happens with RabbitMQ. Some of the inter-cluster connections are using the VIP as a source address. When the VIP moves, RabbitMQ sees other cluster members disappear, and things generally fall apart from there. Hi, just adding that it's a ipv6 only problem because VIP on ipv4 are /32 and thus not the preferred source address to connect to the other nodes. We tried adding the VIP as a /128 and it got refused by the kernel. (In reply to Sofer Athlan-Guyot from comment #45) > We tried adding the VIP as a /128 and it got refused by the kernel. Can you please provide an error message? I can add VIP with 128 without any issue. Hi, (In reply to Fabio Massimo Di Nitto from comment #46) > (In reply to Sofer Athlan-Guyot from comment #45) > > We tried adding the VIP as a /128 and it got refused by the kernel. > > Can you please provide an error message? I can add VIP with 128 without any > issue. So the /128 error was due to the fact that IPaddr2 resource agent cannot find the interface when we use ipv6 (while it can when we use ipv4) and the problem is described there https://bugzilla.redhat.com/show_bug.cgi?id=1445628 To be more precise here what is not working: pcs resource update ip-fd00.fd00.fd00.2000..5 ip=fd00:fd00:fd00:2000::5 cidr_netmask=128 * ip-fd00.fd00.fd00.2000..5_start_0 on ctrl-r01-01 'unknown error' (1): call=166, status=complete, exitreason='Unable to find nic, or netmask mismatch.', (In reply to Sofer Athlan-Guyot from comment #48) > Hi, > > (In reply to Fabio Massimo Di Nitto from comment #46) > > (In reply to Sofer Athlan-Guyot from comment #45) > > > We tried adding the VIP as a /128 and it got refused by the kernel. > > > > Can you please provide an error message? I can add VIP with 128 without any > > issue. > > So the /128 error was due to the fact that IPaddr2 resource agent cannot > find the interface when we use ipv6 (while it can when we use ipv4) and the > problem is described there > https://bugzilla.redhat.com/show_bug.cgi?id=1445628 > > To be more precise here what is not working: > > pcs resource update ip-fd00.fd00.fd00.2000..5 ip=fd00:fd00:fd00:2000::5 > cidr_netmask=128 > > * ip-fd00.fd00.fd00.2000..5_start_0 on ctrl-r01-01 'unknown error' (1): > call=166, status=complete, exitreason='Unable to find nic, or netmask > mismatch.', Ok this is not the kernel rejecting the route, but the resource-agent because it doesn´t have enough information to determine where to install the VIP. Current status: a) As a recap: we have identified the "network outage" being due to rabbitmq/galera binding the source part of its sockets to a VIP. Any subsequent move of the VIP shall result in rabbit/galera node being disconnected from the rabbit/galera cluster for a short period of time. Note: all services are potentially affected and will have one or more TCP connection hung. When adding the IPv6 VIP, a new route in the routing table gets created. This has the longest prefix and is choosen. Using "noprefixroute" solves this when we manually add the address. In IPv4, no entry is created by default in the routing table so the VIP is never choosen as a source address when a rabbit/galera node connects to one of its peers. b) We have two reviews up to revert to the rabbitmq ha-mode: all policy which should make rabbitmq more robust against this failure: 1. https://review.openstack.org/#/c/459994/ (puppet-tripleo) 2. https://review.openstack.org/#/c/459998/ (tripleo-heat-templates) c) We'd like another round of testing with the above patches included so that we can be reasonably certain that rabbitmq is more resilient in the face of a short network outage. (We want this because it might very well be that OSP10 did not trigger an haproxy restart and VIP relocation). Marius can you take care of this and report back? If we can reasonably confirm that the reviews at b) do not exhibit the issue anymore, we propose to pull them downstream even before upstream merges (upstream needs to land master first and then ocata, which depending on CI weather might take a few days). In the meantime we will keep working on a solution regarding the ipv6 VIP route/netmask issue. (In reply to Michele Baldessari from comment #50) > c) We'd like another round of testing with the above patches included so > that we can be reasonably certain that rabbitmq is more resilient in the > face of a short network outage. (We want this because it might very well be > that OSP10 did not trigger an haproxy restart and VIP relocation). Marius > can you take care of this and report back? > Yes, I'm currently running the upgrades with the 2 patches applied. I'll get back with the results once they finish. Marius, you can use: https://review.openstack.org/#/c/460204/ (puppet-tripleo) https://review.openstack.org/#/c/460202/ (tht) those are the ocata backports already (since master for tht deviated from downstream a bit and the patch did not apply cleanly) Let's keep this bug for tracking the rabbitmq ha-mode changes. Let's use https://bugzilla.redhat.com/show_bug.cgi?id=1445905 to track the ipv6 VIP issue. (In reply to Michele Baldessari from comment #52) > Marius, you can use: > https://review.openstack.org/#/c/460204/ (puppet-tripleo) > https://review.openstack.org/#/c/460202/ (tht) > > those are the ocata backports already (since master for tht deviated from > downstream a bit and the patch did not apply cleanly) I tested on environments including these patches but didn't get good results - live migration failed/got stuck on 3/5 environments. Reproduced the issue, Environment: openstack-puppet-modules-10.0.0-1.el7ost.noarch instack-undercloud-6.0.0-6.el7ost.noarch openstack-tripleo-heat-templates-6.0.0-7.el7ost.noarch openstack-nova-cert-14.0.4-4.el7ost.noarch openstack-nova-scheduler-14.0.4-4.el7ost.noarch puppet-nova-9.5.0-3.el7ost.noarch python-nova-14.0.4-4.el7ost.noarch python-novaclient-6.0.0-1.el7ost.noarch openstack-nova-compute-14.0.4-4.el7ost.noarch openstack-nova-novncproxy-14.0.4-4.el7ost.noarch openstack-nova-console-14.0.4-4.el7ost.noarch openstack-nova-common-14.0.4-4.el7ost.noarch openstack-nova-api-14.0.4-4.el7ost.noarch openstack-nova-conductor-14.0.4-4.el7ost.noarch [stack@director ~]$ nova list --all +--------------------------------------+-------------+----------------------------------+--------+------------+-------------+------------------------------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-------------+----------------------------------+--------+------------+-------------+------------------------------------------+ | d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 | cirros_test | 95eb1ca2fd094344a41cddc73e46dbca | ACTIVE | - | Running | tenant_net=192.168.201.10, 192.168.191.7 | +--------------------------------------+-------------+----------------------------------+--------+------------+-------------+------------------------------------------+ [stack@director ~]$ nova live-migration d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 overcloud-compute-2 ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. <class 'nova.exception.ComputeHostNotFound'> (HTTP 500) (Request-ID: req-3cbbd7d7-98d0-4b9f-9956-f62b453a3377) [stack@director ~]$ . overcloudrc^C [stack@director ~]$ nova show d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9|grep hyper | OS-EXT-SRV-ATTR:hypervisor_hostname | overcloud-compute-0.cwdg720xd01.org | [stack@director ~]$ nova live-migration d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 overcloud-compute-2.cwdg720xd01.org ERROR (Conflict): Cannot 'os-migrateLive' instance d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 while it is in task_state migrating (HTTP 409) (Request-ID: req-a954dcdc-4587-4b4b-bd5f-c1d9513cfc4d) [stack@director ~]$ nova show d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9|grep hyper +--------------------------------------+-------------+----------------------------------+-----------+------------+-------------+------------------------------------------+ | ID | Name | Tenant ID | Status | Task State | Power State | Networks | +--------------------------------------+-------------+----------------------------------+-----------+------------+-------------+------------------------------------------+ | d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 | cirros_test | 95eb1ca2fd094344a41cddc73e46dbca | MIGRATING | migrating | Running | tenant_net=192.168.201.10, 192.168.191.7 | +--------------------------------------+-------------+----------------------------------+-----------+------------+-------------+------------------------------------------+ Restarted rabbitmq on controllers - didn't help. restarted openstack-nova-compute.service on the compute from which I tried to migrate the instance - the instance returned to be active. Launched the nova live-migration command again - and it succeeded. (In reply to Alexander Chuzhoy from comment #55) > Restarted rabbitmq on controllers - didn't help. > restarted openstack-nova-compute.service on the compute from which I tried > to migrate the instance - the instance returned to be active. > Launched the nova live-migration command again - and it succeeded. So there is a workaround that we can add to the Release Notes for this specific situation. Some details re setup in comment #55: This was during Upgrade from OSP10 to OSP11 on a setup with IPv4. (In reply to Alexander Chuzhoy from comment #58) > Some details re setup in comment #55: > This was during Upgrade from OSP10 to OSP11 on a setup with IPv4. So I took a very quick peek at an env that had the issue (even though it seems to be currently in an odd state? aka I can't reach controller-0 and controller-1 from the undercloud). I would say that this problem is rather different than what is being tracked in this bug here: [root@overcloud-controller-2 ~]# grep -C2 partition /var/log/rabbitmq/rabbit\@overcloud-controller-2.log [root@overcloud-controller-2 ~]# grep -C2 partial /var/log/rabbitmq/rabbit\@overcloud-controller-2.log There are no signs of partitions like we observed with ipv6. So I'd say it is best if we opened a new bug for what you are observing. Also because yum updates packages at around Apr 26 20:20:27 and the only etimedout rabbitmq logs are around: =ERROR REPORT==== 26-Apr-2017::00:02:48 === closing AMQP connection <0.3086.0> (192.168.140.21:34974 -> 192.168.140.22:5672): {inet_error,etimedout} So 19 hours before. So if you can reproduce this, it is best that we track this problem in a separate bug. Setup: 3 controllers, 2 compute, 3 ceph - IPv6 OSP10 upgrade to OSP11 Before upgrading the compute nodes I ran host evacuate live twice and the Instance was Active and evacuated successfully. Packages: openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch puppet-tripleo-6.3.0-12.el7ost.noarch Policy changed: Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster) Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}" Meta Attrs: notify=true Operations: monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10) start interval=0s timeout=200s (rabbitmq-start-interval-0s) stop interval=0s timeout=200s (rabbitmq-stop-interval-0s) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245 |