Bug 1441635

Summary:

OSP10 -> OSP11 upgrade: nova instance live migration gets stuck with MIGRATING status before running compute node upgrade

Product:

Red Hat OpenStack

Reporter:

Marius Cornea <mcornea>

Component:

openstack-tripleo-heat-templates

Assignee:

Peter Lemenkov <plemenko>

Status:

CLOSED ERRATA

QA Contact:

Udi Shkalim <ushkalim>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

11.0 (Ocata)

CC:

apevec, dbecker, fdinitto, jcoufal, jeckersb, lbopf, lhh, mabaakou, mandreou, markmc, mburns, mcornea, michele, morazi, panbalag, plemenko, rhel-osp-director-maint, rscarazz, sasha, sathlang, srevivo, vstinner

Target Milestone:

Keywords:

Triaged

Target Release:

11.0 (Ocata)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

puppet-tripleo-6.3.0-12.el7ost openstack-tripleo-heat-templates-6.0.0-9.el7ost

Doc Type:

Release Note

Doc Text:

If this bug requires documentation, please select an appropriate Doc Type value.

Story Points:

---

Clone Of:

Clones:

1447355 (view as bug list)

Environment:

Last Closed:

2017-05-17 20:20:24 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1436784

Bug Blocks:

1447355

Attachments:

Description	Flags
RMQ chain of events	none

Description Marius Cornea 2017-04-12 11:40:23 UTC

Description of problem:

During the OSP10 -> OSP11 upgrade process nova instance live migration gets stuck with MIGRATING status before running compute node upgrade. This happens on an environment with 3 controllers, 2 compute nodes, 3 custom nodes running systemd managed services, 1 swift object node and 3 ceph nodes. The environment uses IPv6 network for the isolated networks. Note that I am not able to reproduce this issue on IPv4 networking.

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-6.0.0-3.el7ost.noarch

How reproducible:
only on IPv6 environments

Steps to Reproduce:
1. Deploy OSP10 with IPv6 isolated networks
2. Run OSP11 upgrade major-upgrade-composable-steps.yaml step
3. Before proceeding to upgrading the compute node live migrate all the instances from it with 'nova host-evacuate-live comp-r01-01.redhat.local'

Actual results:
+--------------------------------------+-------------------------+---------------+
| Server UUID                          | Live Migration Accepted | Error Message |
+--------------------------------------+-------------------------+---------------+
| ae83be51-cb9d-4f5b-86ba-75ed1b102c39 | True                    |               |
+--------------------------------------+-------------------------+---------------+

+----+-------------+-----------+--------------------------+--------------+-----------+----------+--------------------------------------+------------+------------+----------------------------+------------+----------------+
| Id | Source Node | Dest Node | Source Compute           | Dest Compute | Dest Host | Status   | Instance UUID                        | Old Flavor | New Flavor | Created At                 | Updated At | Type           |
+----+-------------+-----------+--------------------------+--------------+-----------+----------+--------------------------------------+------------+------------+----------------------------+------------+----------------+
| 3  | -           | -         | comp-r01-01.redhat.local | -            | -         | accepted | ae83be51-cb9d-4f5b-86ba-75ed1b102c39 | 3          | 3          | 2017-04-12T11:10:33.000000 | -          | live-migration |
+----+-------------+-----------+--------------------------+--------------+-----------+----------+--------------------------------------+------------+------------+----------------------------+------------+----------------+

| ae83be51-cb9d-4f5b-86ba-75ed1b102c39 | st-provinstance-fvfbxtffm5gy-my_instance-yulueooc7lls | MIGRATING | provider01=172.16.19.13                                                                             | Fedora     |


Expected results:
The instance gets migrated from comp-r01-01.redhat.local node.

Additional info:
Attaching sosreports.

Comment 2 Marios Andreou 2017-04-12 14:33:46 UTC

o/ marius spent some time poking at logs here. I can see rabbit related errors on compute-r00-00 (the 'migrate_to' node) in /var/log/nova/nova-compute.log [0]. The main question I have is was this compute node in a good state before you tried the migration? I think those rabbit errors coincide with rabbit going down on the controller upgrade. Does a service restart (nova-compute on the affected compute-0 for starters) fix this issue and so could it be related to BZ 1440680 but that is still TBD.

[0]
2017-04-12 10:27:14.420 58749 INFO nova.service [-] Starting compute node (version 14.0.4-4.el7ost)
2017-04-12 10:27:14.454 58749 INFO nova.virt.libvirt.driver [-] Connection event '1' reason 'None'
2017-04-12 10:27:14.473 58749 INFO nova.virt.libvirt.host [req-9ccee0d6-80ff-44e2-af3d-fe3cae789a2c - - - - -] Libvirt host capabilities <capabilities>
2017-04-12 10:33:42.079 58749 ERROR oslo.messaging._drivers.impl_rabbit [-] [6346950d-3061-40e3-8b55-3e2315a007f9] AMQP server on fd00:fd00:fd00:2000::22:5672 is unreachable: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'. Trying again in 1 seconds. Client port: None
2017-04-12 10:36:21.334 58749 ERROR nova.compute.manager     raise exceptions.from_response(resp, method, url)
2017-04-12 10:36:21.334 58749 ERROR nova.compute.manager ServiceUnavailable: Service Unavailable (HTTP 503)
2017-04-12 10:36:21.334 58749 ERROR nova.compute.manager 
2017-04-12 10:36:53.258 58749 ERROR nova.compute.manager [req-fdb8332f-ea5c-45e4-ad9d-456bd31518b2 - - - - -] [instance: dfa3192e-92b7-4175-a102-3b730a37331c] An error occurred while refreshing the network cache.
2017-04-12 10:36:53.258 58749 ERROR nova.compute.manager [instance: dfa3192e-92b7-4175-a102-3b730a37331c] Traceback (most recent call last):
2017-04-12 10:36:53.258 58749 ERROR nova.compute.manager [instance: dfa3192e-92b7-4175-a102-3b730a37331c]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 5790, in _heal_instance_info_cache

2017-04-12 11:38:45.194 58749 ERROR root NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'compute.comp-r00-00.redhat.local' in vhost '/' due to timeout

2017-04-12 11:32:25.177 58749 ERROR oslo.messaging._drivers.impl_rabbit [req-fdb8332f-ea5c-45e4-ad9d-456bd31518b2 - - - - -] [22097b85-0b02-41e7-8afe-41d37bd251f9] AMQP server on fd00:fd00:fd00:2000::22:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 56946
2017-04-12 11:32:25.743 58749 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'compute.comp-r00-00.redhat.local' in vhost '/' due to timeout
2017-04-12 11:32:25.743 58749 ERROR root [-] Unexpected exception occurred 1 time(s)

Comment 3 Marius Cornea 2017-04-12 15:33:36 UTC

(In reply to marios from comment #2)
> o/ marius spent some time poking at logs here. I can see rabbit related
> errors on compute-r00-00 (the 'migrate_to' node) in
> /var/log/nova/nova-compute.log [0]. The main question I have is was this
> compute node in a good state before you tried the migration? I think those
> rabbit errors coincide with rabbit going down on the controller upgrade.
> Does a service restart (nova-compute on the affected compute-0 for starters)
> fix this issue and so could it be related to BZ 1440680 but that is still
> TBD.
> 

After running systemctl restart openstack-nova-compute.service on the compute nodes both services go down:

| 78 | nova-compute     | comp-r01-01.redhat.local | nova     | enabled | down  | 2017-04-12T15:25:26.000000 | -               |
| 87 | nova-compute     | comp-r00-00.redhat.local | nova     | enabled | down  | 2017-04-12T15:24:26.000000 | -               |


I was able to recover by restarting rabbitmq on the controllers by 'pcs resource restart rabbitmq'. Note: at this point the controllers are already upgraded.

Live migration also completed fine after restarting rabbitmq on controllers so it appears to me that the issue is with Nova services reaching the RabbitMQ servers.

I'm going to see if this error reproduces on the same environment with a fresh OSP10 deployment to rule out if this is an issue showing up during upgrade or not.

Comment 4 Marius Cornea 2017-04-12 16:36:53 UTC

I confirmed live migration works fine on the same environment after fresh OSP10 deployment.

Comment 5 Marius Cornea 2017-04-13 11:17:44 UTC

It looks that restart rabbitmq on controller nodes doesn't always fixes this issue. I added it as a workaround in the jobs affected by this and I am still able to reproduce it.

Comment 6 Marios Andreou 2017-04-13 12:25:02 UTC

mcornea please next box this reproduces on can you set it aside so we can further investigate what is going on after the controller upgrade. I was surprised to see that rabbit restart fixed this issue (although not always. Is it both? nova-compute and rabbit restart?). I went looking for any recent v6 related changes that might affect this (could only find https://review.openstack.org/#/c/450144/ but pretty sure not related; it woulnd't be fixed by rabbit restart if it was ipv6 rules related). 

If it turns out we need more/extra service restart we can do that but still not clear what the issue is.

Comment 10 Sofer Athlan-Guyot 2017-04-18 13:57:20 UTC

Hi Marius,

regarding comment 8, could you check if adding the IPv4 entries in the etc/hosts of all servers before doing the migration solves this issue ?  That would narrow the problem to this ipv4 dns resolution.

Thanks,

Comment 11 Marios Andreou 2017-04-18 14:03:45 UTC

(In reply to Sofer Athlan-Guyot from comment #10)
> Hi Marius,
> 
> regarding comment 8, could you check if adding the IPv4 entries in the
> etc/hosts of all servers before doing the migration solves this issue ? 
> That would narrow the problem to this ipv4 dns resolution.
> 
> Thanks,

did you mean ipv6? I think it is not reproducible in v4 as per comment #0

Comment 12 Sofer Athlan-Guyot 2017-04-18 20:42:08 UTC

Hi Marios,

> did you mean ipv6? I think it is not reproducible in v4 as per comment #0

well the referenced problem in comment 8 is about rabbitmq resolving hostname on ipv4 address, and thus avoiding the defined (in etc/hosts) ipv6 resolution.  So my comment was about testing that this problem was similar to the one suggested by Fabio.

But all in all I don't think this is related to bz 1360398.

Looking again at the log we can see:

sosreport-comp-r00-00.redhat.local-20170412114357/var/log/nova/nova-compute.log

2017-04-12 11:09:15.045 58749 ERROR oslo.messaging._drivers.impl_rabbit [req-fdb8332f-ea5c-45e4-ad9d-456bd31518b2 - - - - -] [22097b85-0b02-41e7-8afe-41d37bd251f9] AMQP server on fd00:fd00:fd00:2000::22:5672 is unreachable: timed out. Trying again in 1 seconds. Client port: 56696

2017-04-12 11:09:15.559 58749 ERROR oslo.messaging._drivers.impl_rabbit [-] Failed to consume message from queue: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'compute.comp-r00-00.redhat.local' in vhost '/' due to timeout

So I don't think that dns resolution (hosts file) has anything to do with it as the connection takes place.

In the rabbitmq log we have the matching error:

sosreport-ctrl-r02-02.redhat.local-20170412115605/var/log/rabbitmq/rabbit

=ERROR REPORT==== 12-Apr-2017::11:12:15 ===
closing AMQP connection <0.11205.0> ([FD00:FD00:FD00:2000::26]:56696 -> [FD00:FD00:FD00:2000::22]:5672 - nova-compute:58749:22097b85-0b02-41e7-8afe-41d37bd251f9):
missed heartbeats from client, timeout: 60s

It looks a bit like https://bugs.launchpad.net/oslo.messaging/+bug/1609766 which is included in version openstack/oslo.messaging 5.10.1, which should be included on the non-upgraded node:

sosreport-comp-r00-00.redhat.local-20170412114357/sos_commands/rpm/package-data
329:python-oslo-messaging-5.10.1-2.el7ost.noarch        Fri Apr  7 20:53:20 2017

Asking Mehdi if it rings a bell anyway as the problem looks similar.

Comment 14 Mehdi ABAAKOUK 2017-04-19 09:51:21 UTC

I wonder if that is normal on crtl00 in /var/log/rabbitmq/rabbit ?

=ERROR REPORT==== 12-Apr-2017::11:07:08 ===
Mnesia('rabbit@ctrl-r00-00'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@ctrl-r02-02'}

=WARNING REPORT==== 12-Apr-2017::11:07:08 ===
Mirrored queue 'neutron-vo-SubPort-1.0' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 12-Apr-2017::11:07:08 ===
Mirrored queue 'neutron-vo-QosPolicy-1.3' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available

=WARNING REPORT==== 12-Apr-2017::11:07:08 ===
Mirrored queue 'l3_agent.api-r01-01.redhat.local' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is ava

=WARNING REPORT==== 12-Apr-2017::11:07:08 ===
Mirrored queue 'cinder-volume_fanout_14322306aeea471fa93ce8c1482e2ae7' in vhost '/': Stopping all nodes on master shutdown since no sync

Comment 15 Mehdi ABAAKOUK 2017-04-19 10:30:04 UTC

My guessing is oslo.messaging try to declare a queue on node 0, but node 0 is a slave for this queue, so it asks to the master to do it, and that tcp call (node-0 to master node) timeout.

Also rabbit is generating many crash report:

# grep -c CRASH rabbit\@ctrl-r00-00-sasl.log
271

Comment 16 Victor Stinner 2017-04-19 11:46:19 UTC

"It looks a bit like https://bugs.launchpad.net/oslo.messaging/+bug/1609766 which is included in version openstack/oslo.messaging 5.10.1, which should be included on the non-upgraded node:

sosreport-comp-r00-00.redhat.local-20170412114357/sos_commands/rpm/package-data
329:python-oslo-messaging-5.10.1-2.el7ost.noarch        Fri Apr  7 20:53:20 2017"

This bug was fixed by the the commit "Fix consuming from unbound reply queue":
https://review.openstack.org/#/c/365959/

This commit was backported to OSP 10 and was first of the first OSP 11 release:

* OSP 10: since python-oslo-messaging 5.10.0-3 (Nov 14 2016)
* OSP 11: since the first package, python-oslo-messaging 5.17.0-1 (Feb 08 2017)

python-oslo-messaging-5.10.1-2 comes from OSP 10 and is more recent than 5.10.0-3, so it already includes the "Fix consuming from unbound reply queue" fix.

In the sosreports, I only see versions with the fix:

haypo@selma$ grep messaging */installed-rpms |cut -f2 -d:|sort -u
python-oslo-messaging-5.10.1-2.el7ost.noarch                Fri Apr  7 20
python-oslo-messaging-5.17.1-2.el7ost.noarch                Wed Apr 12 10

Comment 18 Peter Lemenkov 2017-04-19 16:40:57 UTC

Hello All!
Status report from RabbitMQ side.

* From the logs posted in comment 1 I really don't see anything related to IPv6. Quite the contrary - I can see that RabbitMQ actually handles IPv6 connections properly.

* I see GH#1035 (https://github.com/rabbitmq/rabbitmq-server/issues/1035). It should be mostly harmless. See also bug #1436784 where some other failure led to the same issue.

* What really surprised me is that it's the second time I can witness the return of GH#530 and GH#544 (first one was in bug #1441685). These two issues believed to be fixed in version 3.6.1, and we're using 3.6.3 and 3.6.5 (after upgrade), so we shouldn't see it. And yet here it is. So far I don't have any seriously looking theories which could explain this. These two are known fatal issues, which can render RabbitMQ cluster unreachable.

Comment 20 Peter Lemenkov 2017-04-19 17:20:35 UTC

Created attachment 1272686 [details]
RMQ chain of events

Comment 23 Prasanth Anbalagan 2017-04-21 01:02:22 UTC

Peter,

Just wanted to note that my setup is IPv4 and I was able to reproduce it again. But I still see errors in /var/log/rabbitmq/rabbit@controller-*.log similar to here,

https://bugzilla.redhat.com/attachment.cgi?id=1267032 
 -- (attachment for bug https://bugzilla.redhat.com/show_bug.cgi?id=1436784)

Comment 24 Peter Lemenkov 2017-04-21 07:36:32 UTC

(In reply to Prasanth Anbalagan from comment #23)
> Peter,
> 
> Just wanted to note that my setup is IPv4 and I was able to reproduce it
> again. But I still see errors in /var/log/rabbitmq/rabbit@controller-*.log
> similar to here,
> 
> https://bugzilla.redhat.com/attachment.cgi?id=1267032 
>  -- (attachment for bug https://bugzilla.redhat.com/show_bug.cgi?id=1436784)

Please try 3.6.5-2 build. It should fix this particular issue.

Comment 27 Sofer Athlan-Guyot 2017-04-21 13:04:46 UTC

We're moving this to MODIFY as we have a new package, but we are aware that it may bounce back to on_dev.

Comment 29 Prasanth Anbalagan 2017-04-21 13:18:57 UTC

Peter,

Looks like the cause is not resize-confirm. After upgrade or update operation followed by rebooting the nodes, rabbitmq on certain controller nodes shutdown causing some compute nodes to lose connectivity. Upon restarting the rabbitmq service, everything works fine.

Comment 30 Peter Lemenkov 2017-04-21 14:07:07 UTC

(In reply to Sofer Athlan-Guyot from comment #27)
> We're moving this to MODIFY as we have a new package, but we are aware that
> it may bounce back to on_dev.

I'm afraid it's too early. This build should fix bug #1436784, and certainly cannot address this one fully.

Comment 39 John Eckersberg 2017-04-25 01:23:47 UTC

Marius,

One thing to try is to comment out the upgrade task which modifies the RabbitMQ policy:

https://github.com/openstack/tripleo-heat-templates/blob/stable/ocata/puppet/services/pacemaker/rabbitmq.yaml#L69

You should be able to just comment out the `pcs resource update ...` line.

In summary, here's what we see happening:

- Something at the end of the upgrade seems to cause a brief (5-6 seconds) loss of network connectivity between the controllers.  This can be seen in both the galera and rabbitmq logs at the same time

- Rabbitmq nodes detect partition and begin stopping the application due to pause_minority

- Partition ends almost immediately, and queues begin to fail over while the application is still in the process of shutting down

- Some queues do not fail over correctly because the slave (only one slave because the ha policy is updated already!!!) is out of sync.  The queue master stops.

- The node finishes shutting down.

- Eventually some combination of the partition handling code and the resource agent gets the cluster back up.  HOWEVER!  When the cluster is back up, the queue master process(es) are still not running.  Why exactly, I am not sure.  This is probably a bug due to some race condition.

- Any operation on the queue from now on will result in a timeout waiting on the non-existent queue master

- Migrations fail because necessary queues are stuck.

Comment 40 Marius Cornea 2017-04-25 12:19:16 UTC

(In reply to John Eckersberg from comment #39)
> Marius,
> 
> One thing to try is to comment out the upgrade task which modifies the
> RabbitMQ policy:
> 
> https://github.com/openstack/tripleo-heat-templates/blob/stable/ocata/puppet/
> services/pacemaker/rabbitmq.yaml#L69
> 
> You should be able to just comment out the `pcs resource update ...` line.
> 
> In summary, here's what we see happening:
> 
> - Something at the end of the upgrade seems to cause a brief (5-6 seconds)
> loss of network connectivity between the controllers.  This can be seen in
> both the galera and rabbitmq logs at the same time
> 
> - Rabbitmq nodes detect partition and begin stopping the application due to
> pause_minority
> 
> - Partition ends almost immediately, and queues begin to fail over while the
> application is still in the process of shutting down
> 
> - Some queues do not fail over correctly because the slave (only one slave
> because the ha policy is updated already!!!) is out of sync.  The queue
> master stops.
> 
> - The node finishes shutting down.
> 
> - Eventually some combination of the partition handling code and the
> resource agent gets the cluster back up.  HOWEVER!  When the cluster is back
> up, the queue master process(es) are still not running.  Why exactly, I am
> not sure.  This is probably a bug due to some race condition.
> 
> - Any operation on the queue from now on will result in a timeout waiting on
> the non-existent queue master
> 
> - Migrations fail because necessary queues are stuck.

Thanks John. I tested by removing the 'pcs resource update rabbitmq set_policy' line and I got better results - 3/4 succesful attempts. The failing one was a bit different than the initial report as the instance didn't get stuck in MIGRATING anymore but it eventually got into ACTIVE state without being migrated from the host. This kind of messages can be observed on the nova compute logs:
http://paste.openstack.org/show/607855/

Comment 42 John Eckersberg 2017-04-25 17:59:49 UTC

Brief update on where things stand:

- We're going to revert back to using "ha-mode all" as the rabbitmq policy.  This does not 100% avoid the problem but in theory and in observation it certainly helps quite a bit.

- There is still evidence of something odd going on with the network.  This will require more investigation, probably involving extensive packet captures and analysis.

- Possibly related to the network issue, galera is exhibiting odd behavior where an extra IP address is appearing and then disappearing as a member of the galera cluster.  This address is allocated as a virtual IP in the deployment, and should not be part of the galera cluster.  We're looking into this more.

Comment 43 John Eckersberg 2017-04-25 21:56:15 UTC

OK, we've figured this out.

The problem is that the VIPs are on the same network as the "normal"
controller interfaces, and it causes problems with routing when a VIP
moves.

Consider controller ctrl-r00-00.  It has IPv6 address
fd00:fd00:fd00:2000::20/64 on vlan200, which is the internalapi
network.

The internalapi VIP has IPv6 address fd00:fd00:fd00:2000::5, also on
vlan200, the internalapi network.

Now, when the VIP resource is active on ctrl-r00-00, the vlan200
interface has both addresses.  There also exists a route to
fd00:fd00:fd00:2000::/64 on dev vlan200.

What happens when galera starts, and the replication connects to the
other cluster members?  The configuration item used is:

wsrep_cluster_address=gcomm://ctrl-r00-00,ctrl-r01-01,ctrl-r02-02

(this is an attribute on the galera resource)

Next, ctrl-r00-00 does a name lookup for ctrl-r01-01, which is:

[root@ctrl-r00-00 ~]# getent hosts ctrl-r01-01
fd00:fd00:fd00:2000::21 ctrl-r01-01.redhat.local ctrl-r01-01 

So we need to make a connection to fd00:fd00:fd00:2000::21.  Remember
we have a route for that network on dev vlan200, but we have two valid
IPv6 addresses on that network.

In short, the kernel chooses the VIP address as the source addr for
the connection.  And that's a problem.  When the VIP moves to a
different host, any outbound packets become unroutable.

So in this circumstance, ctrl-r01-01 and ctrl-r02-02 will see the
galera membership of ctrl-r00-00 as originating from the VIP, not from
the address normally associated with ctrl-r00-00.  So when the VIP
moves, galera logs that a member with the VIP address went away.  This
is the suspicious behavior we observed previously.

The same thing happens with RabbitMQ.  Some of the inter-cluster
connections are using the VIP as a source address.  When the VIP
moves, RabbitMQ sees other cluster members disappear, and things
generally fall apart from there.

Comment 44 Sofer Athlan-Guyot 2017-04-26 08:15:49 UTC

Hi,

just adding that it's a ipv6 only problem because VIP on ipv4 are /32 and thus not the preferred source address to connect to the other nodes.

Comment 45 Sofer Athlan-Guyot 2017-04-26 08:21:51 UTC

We tried adding the VIP as a /128 and it got refused by the kernel.

Comment 46 Fabio Massimo Di Nitto 2017-04-26 08:23:37 UTC

(In reply to Sofer Athlan-Guyot from comment #45)
> We tried adding the VIP as a /128 and it got refused by the kernel.

Can you please provide an error message? I can add VIP with 128 without any issue.

Comment 48 Sofer Athlan-Guyot 2017-04-26 09:49:27 UTC

Hi,

(In reply to Fabio Massimo Di Nitto from comment #46)
> (In reply to Sofer Athlan-Guyot from comment #45)
> > We tried adding the VIP as a /128 and it got refused by the kernel.
> 
> Can you please provide an error message? I can add VIP with 128 without any
> issue.

So the /128 error was due to the fact that IPaddr2 resource agent cannot find the interface when we use ipv6 (while it can when we use ipv4) and the problem is described there https://bugzilla.redhat.com/show_bug.cgi?id=1445628

To be more precise here what is not working:

    pcs resource update ip-fd00.fd00.fd00.2000..5 ip=fd00:fd00:fd00:2000::5 cidr_netmask=128 

* ip-fd00.fd00.fd00.2000..5_start_0 on ctrl-r01-01 'unknown error' (1): call=166, status=complete, exitreason='Unable to find nic, or netmask mismatch.',

Comment 49 Fabio Massimo Di Nitto 2017-04-26 10:06:22 UTC

(In reply to Sofer Athlan-Guyot from comment #48)
> Hi,
> 
> (In reply to Fabio Massimo Di Nitto from comment #46)
> > (In reply to Sofer Athlan-Guyot from comment #45)
> > > We tried adding the VIP as a /128 and it got refused by the kernel.
> > 
> > Can you please provide an error message? I can add VIP with 128 without any
> > issue.
> 
> So the /128 error was due to the fact that IPaddr2 resource agent cannot
> find the interface when we use ipv6 (while it can when we use ipv4) and the
> problem is described there
> https://bugzilla.redhat.com/show_bug.cgi?id=1445628
> 
> To be more precise here what is not working:
> 
>     pcs resource update ip-fd00.fd00.fd00.2000..5 ip=fd00:fd00:fd00:2000::5
> cidr_netmask=128 
> 
> * ip-fd00.fd00.fd00.2000..5_start_0 on ctrl-r01-01 'unknown error' (1):
> call=166, status=complete, exitreason='Unable to find nic, or netmask
> mismatch.',

Ok this is not the kernel rejecting the route, but the resource-agent because it doesn´t have enough information to determine where to install the VIP.

Comment 50 Michele Baldessari 2017-04-26 13:56:13 UTC

Current status:
a) As a recap: we have identified the "network outage" being due to rabbitmq/galera binding the source part of its sockets to a VIP. Any subsequent move of the VIP shall result in rabbit/galera node being disconnected from the rabbit/galera cluster for a short period of time. Note: all services are potentially affected and will have one or more TCP connection hung.

When adding the IPv6 VIP, a new route in the routing table gets created. This has the longest prefix and is choosen. Using "noprefixroute" solves this when we manually add the address.

In IPv4, no entry is created by default in the routing table so the VIP is never choosen as a source address when a rabbit/galera node connects to one of its peers.

b) We have two reviews up to revert to the rabbitmq ha-mode: all policy which should make rabbitmq more robust against this failure:
1. https://review.openstack.org/#/c/459994/ (puppet-tripleo)
2. https://review.openstack.org/#/c/459998/ (tripleo-heat-templates)

c) We'd like another round of testing with the above patches included so that we can be reasonably certain that rabbitmq is more resilient in the face of a short network outage. (We want this because it might very well be that OSP10 did not trigger an haproxy restart and VIP relocation). Marius can you take care of this and report back?

If we can reasonably confirm that the reviews at b) do not exhibit the issue anymore, we propose to pull them downstream even before upstream merges (upstream needs to land master first and then ocata, which depending on CI weather might take a few days).

In the meantime we will keep working on a solution regarding the ipv6 VIP route/netmask issue.

Comment 51 Marius Cornea 2017-04-26 14:35:51 UTC

(In reply to Michele Baldessari from comment #50)
> c) We'd like another round of testing with the above patches included so
> that we can be reasonably certain that rabbitmq is more resilient in the
> face of a short network outage. (We want this because it might very well be
> that OSP10 did not trigger an haproxy restart and VIP relocation). Marius
> can you take care of this and report back?
> 

Yes, I'm currently running the upgrades with the 2 patches applied. I'll get back with the results once they finish.

Comment 52 Michele Baldessari 2017-04-26 16:15:35 UTC

Marius, you can use:
https://review.openstack.org/#/c/460204/ (puppet-tripleo)
https://review.openstack.org/#/c/460202/ (tht)

those are the ocata backports already (since master for tht deviated from downstream a bit and the patch did not apply cleanly)

Comment 53 Michele Baldessari 2017-04-26 18:18:43 UTC

Let's keep this bug for tracking the rabbitmq ha-mode changes. Let's use https://bugzilla.redhat.com/show_bug.cgi?id=1445905 to track the ipv6 VIP issue.

Comment 54 Marius Cornea 2017-04-26 20:28:41 UTC

(In reply to Michele Baldessari from comment #52)
> Marius, you can use:
> https://review.openstack.org/#/c/460204/ (puppet-tripleo)
> https://review.openstack.org/#/c/460202/ (tht)
> 
> those are the ocata backports already (since master for tht deviated from
> downstream a bit and the patch did not apply cleanly)

I tested on environments including these patches but didn't get good results - live migration failed/got stuck on 3/5 environments.

Comment 55 Alexander Chuzhoy 2017-04-27 01:05:15 UTC

Reproduced the issue,
Environment:
openstack-puppet-modules-10.0.0-1.el7ost.noarch
instack-undercloud-6.0.0-6.el7ost.noarch
openstack-tripleo-heat-templates-6.0.0-7.el7ost.noarch
openstack-nova-cert-14.0.4-4.el7ost.noarch
openstack-nova-scheduler-14.0.4-4.el7ost.noarch
puppet-nova-9.5.0-3.el7ost.noarch
python-nova-14.0.4-4.el7ost.noarch
python-novaclient-6.0.0-1.el7ost.noarch
openstack-nova-compute-14.0.4-4.el7ost.noarch
openstack-nova-novncproxy-14.0.4-4.el7ost.noarch
openstack-nova-console-14.0.4-4.el7ost.noarch
openstack-nova-common-14.0.4-4.el7ost.noarch
openstack-nova-api-14.0.4-4.el7ost.noarch
openstack-nova-conductor-14.0.4-4.el7ost.noarch


[stack@director ~]$ nova list --all

+--------------------------------------+-------------+----------------------------------+--------+------------+-------------+------------------------------------------+
| ID                                   | Name        | Tenant ID                        | Status | Task State | Power State | Networks                                 |
+--------------------------------------+-------------+----------------------------------+--------+------------+-------------+------------------------------------------+
| d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 | cirros_test | 95eb1ca2fd094344a41cddc73e46dbca | ACTIVE | -          | Running     | tenant_net=192.168.201.10, 192.168.191.7 |
+--------------------------------------+-------------+----------------------------------+--------+------------+-------------+------------------------------------------+
[stack@director ~]$ nova live-migration d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 overcloud-compute-2
ERROR (ClientException): Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.ComputeHostNotFound'> (HTTP 500) (Request-ID: req-3cbbd7d7-98d0-4b9f-9956-f62b453a3377)
[stack@director ~]$ . overcloudrc^C
[stack@director ~]$ nova show d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9|grep hyper

| OS-EXT-SRV-ATTR:hypervisor_hostname  | overcloud-compute-0.cwdg720xd01.org                                              |
[stack@director ~]$ nova live-migration d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 overcloud-compute-2.cwdg720xd01.org

ERROR (Conflict): Cannot 'os-migrateLive' instance d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 while it is in task_state migrating (HTTP 409) (Request-ID: req-a954dcdc-4587-4b4b-bd5f-c1d9513cfc4d)
[stack@director ~]$ nova show d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9|grep hyper

+--------------------------------------+-------------+----------------------------------+-----------+------------+-------------+------------------------------------------+
| ID                                   | Name        | Tenant ID                        | Status    | Task State | Power State | Networks                                 |
+--------------------------------------+-------------+----------------------------------+-----------+------------+-------------+------------------------------------------+
| d4a6e697-e5d7-4e28-9e4c-c3ee2a961db9 | cirros_test | 95eb1ca2fd094344a41cddc73e46dbca | MIGRATING | migrating  | Running     | tenant_net=192.168.201.10, 192.168.191.7 |
+--------------------------------------+-------------+----------------------------------+-----------+------------+-------------+------------------------------------------+


Restarted rabbitmq on controllers - didn't help.
restarted openstack-nova-compute.service on the compute from which I tried to migrate the instance - the instance returned to be active.
Launched the nova live-migration command again - and it succeeded.

Comment 56 Fabio Massimo Di Nitto 2017-04-27 03:10:30 UTC

(In reply to Alexander Chuzhoy from comment #55)

> Restarted rabbitmq on controllers - didn't help.
> restarted openstack-nova-compute.service on the compute from which I tried
> to migrate the instance - the instance returned to be active.
> Launched the nova live-migration command again - and it succeeded.

So there is a workaround that we can add to the Release Notes for this specific situation.

Comment 58 Alexander Chuzhoy 2017-04-27 13:15:36 UTC

Some details re setup in comment #55:
This was during Upgrade from OSP10 to OSP11 on a setup with IPv4.

Comment 59 Michele Baldessari 2017-04-27 13:46:26 UTC

(In reply to Alexander Chuzhoy from comment #58)
> Some details re setup in comment #55:
> This was during Upgrade from OSP10 to OSP11 on a setup with IPv4.

So I took a very quick peek at an env that had the issue (even though it seems to be currently in an odd state? aka I can't reach controller-0 and controller-1 from the undercloud). I would say that this problem is rather different than what is being tracked in this bug here:
[root@overcloud-controller-2 ~]# grep -C2 partition /var/log/rabbitmq/rabbit\@overcloud-controller-2.log 
[root@overcloud-controller-2 ~]# grep -C2 partial  /var/log/rabbitmq/rabbit\@overcloud-controller-2.log 

There are no signs of partitions like we observed with ipv6. So I'd say it is best if we opened a new bug for what you are observing. Also because yum updates packages at around Apr 26 20:20:27 and the only etimedout rabbitmq logs are around:
=ERROR REPORT==== 26-Apr-2017::00:02:48 ===
closing AMQP connection <0.3086.0> (192.168.140.21:34974 -> 192.168.140.22:5672):
{inet_error,etimedout}

So 19 hours before. So if you can reproduce this, it is best that we track this problem in a separate bug.

Comment 64 Udi Shkalim 2017-05-07 13:29:36 UTC

Setup: 3 controllers, 2 compute, 3 ceph - IPv6
OSP10 upgrade to OSP11

Before upgrading the compute nodes I ran host evacuate live twice and the Instance was Active and evacuated successfully. 

Packages:
openstack-tripleo-heat-templates-6.0.0-10.el7ost.noarch
puppet-tripleo-6.3.0-12.el7ost.noarch

Policy changed:
Resource: rabbitmq (class=ocf provider=heartbeat type=rabbitmq-cluster)
  Attributes: set_policy="ha-all ^(?!amq\.).* {"ha-mode":"all"}"
  Meta Attrs: notify=true 
  Operations: monitor interval=10 timeout=40 (rabbitmq-monitor-interval-10)
              start interval=0s timeout=200s (rabbitmq-start-interval-0s)
              stop interval=0s timeout=200s (rabbitmq-stop-interval-0s)

Comment 65 errata-xmlrpc 2017-05-17 20:20:24 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1245